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PREFACE. 

IT is the aim of this book to present certain topics of 
elementary statistical theory which have been found useful 
and workable. 

The statement would seem warranted that no more than 
the very simplest methods should be used by one who has no 
knowledge of the principles underlying the methods. Busy 
though the scientist may be, he owes it to the science and to 
the persons who may accept his results to have some familiarity 
with his tools. The blind application of formulas in statistics 
has been made possible by the convenient manuals that have ap- 
peared and has been encouraged by the fact that the theory has 
been so surrounded by intricate and involved mathematics that it 
was only by an extended research that a knowledge of the theory 
could be obtained. 

There is no real reason why the theory of statistical methods 
should remain in obscurity. The necessary mathematics is 
largely elementary arithmetic and except in a few cases there is 
no need for higher mathematics. This book presupposes a 
reasonable familiarity with elementary mathematics only. 

Because of the desire to eliminate higher mathematics from 
the body of the book the discussion of the theory of the Gen- 
eralized Frequency Curves of Pearson has been deferred to Ap- 
pendix I. For the same reason a discussion of the promising 
method of variate differences is omitted, as is the mathematical 
theory of random selection. 

While it is hoped that the statistical data of this book may 
be of interest in themselves they have been selected solely with 
reference to their usefulness in illustrating the theory. For this 
reason all examples and exercises have to do with very simple 
data. The author will appreciate notice of such numerical and 
other inaccuracies as may be found. 

The idea is emphasized that a formula or method to be of 
practical and trustworthy value to a statistician must be so 
simple and direct that the final results can be interpreted in terms 
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of the original conditions or the given data. To illustrate, if the 
arithmetic mean is ten per cent, larger in one distribution than in 
another what difference does this variation indicate in the forms 
of the distributions or in the values of the two series of measure- 
ments? If one correlation ratio is 0.54 and a second 0.59 how 
much more closely related are the attributes in the second than 
in the first? It must always be remembered that mathematics is 
but a tool to be used when the desired results can be more 
efficiently attained by its use, and that a formula is nothing more 
than a statement in mathematical language of a method of com- 
putation already thought out and understood. The difficulties 
that may arise in this subject are not primarily mathematical. 
They are essentially a part of the necessarily difficult task of 
analyzing a statistical distribution. 

The preparation of a book on mathematical statistics to 
appeal to scientific workers in fields ordinarily considered to 
be non-mathematical is essentially a matter of experimentation. 
It is the hope of the author that this book may stimulate interest 
in the methods of presenting statistical theory and in the more 
inclusive problem of making mathematical theory more widely 
available. Any suggestions or criticism of this presentation will 
be appreciated. 

The Bibliography of Appendix II is inserted as a guide to 
advanced reading in the subject of mathematical statistics; the 
contributions of Prof. Pearson are to be noted especially. 

It seems hardly necessary to refer to the debt which any- 
one who works in statistical theory must owe to Professor- 
Karl Pearson. Because of his "Tables for Statisticians and 
Biometricians" the formulas of Appendix I are not given in 
more detail. 

Professor James McMahon has given most generously of 
his time and interest. Whatever assistance this book may afford 
to the practical worker in statistics is in a large measure due to 
the influence of Professor Walter F. Willcox, whose critical 
insight into the limitations and the possibilities of statistical 
methods together with the originality and practical initiative 
which permeate his research and instructional work place all 
his students under obligations to him. 
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January, 


5.0 inches 


May, 


4.8 inches 


February, 


1.5 inches 


June, 


3.5 inches 


March, 


4.9 inches 


July, 


0.7 inches 


April, 


2.3 inches 


August, 


3.2 inches 



CURVE PLOTTING. 

Platting the Data. Let us plot the following data* of 
the monthly precipitation at Columbus for the year 1916: 

September, 1.5 inches 

October, 1.8 inches 

November, 1.6 inches 

December, 3.6 inches 

A horizontal straight line is first drawn and at equal distances 
on this line twelve points are located, one for each month. On a 
vertical line erected at the point corresponding to the month of 
January equal intervals are laid off, one for each inch of 
precipitation ; and these intervals are subdivided into tenths. The 
two series of points are called the scales. It is usual to des- 
ignate the horizontal and the vertical scale lines by O — X and 
O — Y respectively, as in Figure i . 
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Fig. I. Monthly Precipitation at the Columbus Station for the 

year 1916. 

The January precipitation is 5.0 inches. Place a dot above 
>the January, or beginning point, at a height corresponding to 5.0 
inches on the vertical scale. The next point is directly above the 
second or February point at a distance corresponding to 1.5 
inches. Continuing in this way we locate a point for each month ; 
the data is then said to be plotted or pictured point by point. 

♦ Annual Meteorological Summary, U. S. Weather Bureau, Colum- 
bus, Ohio. 1917. 
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Exercises. 








1. 


Plot the following 


March precipitation 


data.* 




1879 . 


. . . . o . o loo*7 . . . . 


. 0.7 


1899 


4.7 1909 ... 


.. 2.7 


1880 . 


.... 2.4 


1890 .... 


. 5.6 


1900 


2.6 


1910 ... 


.. 0.3 


1881 . 


.... 4.0 


1891 .... 


. 4.6 


1901 


1.8 


1911 ... 


.. 2.4 


1882 . 


.... 4.8 


1892 .... 


. 2.2 


1902 


2.6 


1912 ... 


.. 4.« 


1883 . 


.... 3.2 


1893 .... 


. 1.9 


1903 


4.1 


1913 ... 


.. 8.1 


1884 . 


.... 3.6 


1894 .... 


. 1.8 


1904 


4.9 


1914 ... 


.. 2.5 


1885 . 


.... 0.5 


1895 .... 


. 1.2 


1905 


1.9 


1915 ... 


.. 1.2 


1886 . 


.... 3.9 


1896 .... 


. 3.0 


1906 


4.6 


1916 ... 


.. 4.^ 


1887 . 


.... 2.6 


1897 .... 


. 5.5 


1907 


5.2 






1888 . 


.... 3.8 


1898 .... 


. 7.0 


1908 


6.0 







2. Plot the following population data for the United States: 



1790 3,929,214 

1800 5,308,483 

1810 7,239,881 

1820 9,638,453 

1830 12,866,020 



1840 17,069,453 

1850 23,191,876 

1860 31,443,321 

1870 38,558,371 

1880 ...... 50,155,783 



1890 62,947,714 

1900 75,994,575 

1910 91,872,266 



In plotting this data take the numbers to the nearest million. 

General Directions for the Lraying off of Scales. The 
object of any graphic representation of statistical data is to pre- 
sent a vivid picture and therefore a diagram too small or too 
large, or too wide or too narrow will not accomplish this purpose 
as efficiently as will a correctly proportioned diagram. This 
means that the widths of the horizontal and the vertical scale 
intervals must be carefully chosen in order to give the complete 
diagram the proper proportions. 

In determining the widths of the intervals account must be 
taken of the nature of the statistical material. If the data is so 
inaccurate, for instance, that the measurements can be determined 
only to the nearest million it would be improper to divide the 
scale into intervals corresponding to thousands. The wealth of 
the country and the value of manufactured articles are examples 
of statistics which do not admit of close subdivision. 

It is useless to have the scale intervals finer than the smallest 
difference which the eye can conveniently distinguish on the dia- 



♦Annual Meteorological Summary, U. S. Weather Bureau; Colum- 
bus, Ohio. 1917. 
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gram. This often means, even in the case of quite accurate ma- 
terial, that the figures of the data must be cut back; for in- 
stance in plotting population data for the United States one mil- 
lion nny be the smallest numerical difference that can be pictured 
on an ordinary sized diagram. 

Usually, as in Figure II, horizontal and vertical lines, called 
coordinate lines, are drawn to assist in carrying the divisions 
of the scales across the diagram. Care must be taken that these 
lines are lightly drawn and are not more numerous than is neces- 
sary. 

Connecting the Points. The eye is assisted in passing 
across a diagram if the plotted points are connected by a curve. 
The curve may be either a series of broken straight lines joining 
the points or a continuous curve passing thru each point without 
sharp angles or abrupt changes in direction. Of the two methods 
the continuous curve is usually to be preferred because of the 
better appearance which it presents. In Figure II the points are 
connected by straight lines and in Figure III a continuous curve 
is drawn. 

Exercises. 

3. Plot the curve of the 1916 rainfall at Columbus from the data 



Fig. II. The Plotted Points of Fic. III. The Points of Fig. 11 
Monthly Temperatures connected connected by a continuous curve. 
by straight lines. 

4. Plot the population curve from the data of Exercise 2. 
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Directions for Plotting Curves. 

1. The general arrangement of a diagram should be from left 
to right and from bottom to top. 

2. Figures for the scales of a diagram should ordinarily be placed 
at the left and along the bottom. 

3. Whenever practicable, the vertical scale should be so chosen 
that the zero line will appear on the diagram. When this is not done 
it is well to indicate that fact by a break in the diagram. 

4. The zero lines must be sharply distinguished from the other 
coordinate lines of the diagram. 

5. The curve must be carefully distinguished from the coordinate 
lines. 

6. The data should accompany the diagram either in the form 
of a tabular statement or placed directly on the diagram. The latter 
method of presenting the original data can sometimes be effectively used, 
especially when the number of items is not large. 

Underlying all rules for the construction of statistical dia- 
grams is the general direction: The diagram must be so ar- 
ranged as to present the data most effectively. Because of the 
great diversity of statistical material and of the wide variety of 
purposes for which data may be collected and presented it is 
not possible to lay down specific rules which are to be followed 
in every case. Whenever the vividness and accuracy of the sta- 
tistical picture is not sacrificed by so doing, the conventional and 
generally accepted ways should be followed. 









Exercises. 








5. 


Plot the following 


data of 


annual 


precipitation.* 




1879 . 


.... 31.? 1889 ... 


.. 28.5 


1899 . 


.... 28.5 , 1909 .. 


... 36.6 


1880 . 


.... 44.7 


1890 ... 


.. 50.7 


1900 . 


. . . . 30 3 


1910 .. 


... 34.8 


1881 . 


. . . . 47 


1891 ... 


.. 42.1 


1901 . 


.... 26.5 


1911 .. 


... 43.4 


1882 . 


.... 51.3 


1892 ... 


.. 33.5 


1902 . 


.... 34.2 


1912 .. 


... 29.6 


1883 . 


.... 48.9 


1893 ... 


. . 38.1 


1903 . 


.... 28.1 


1913 .. 


... 40.9 


1884 . 


.... 31.0 


1894 ... 


.. 29.5 


1904 . 


.... 31.5 


1914 .. 


... 31.2 


1885 . 


.... 43.3 


1895 ... 


.. 30.7 


1905 . 


.... 35.1 


1915 .. 


... 39.9 


1886 . 


.... 42.4 


1896 ... 


.. 40.5 


1906 . 


. . . . 33 . 7 


1916 .. 


... 34.4 


1887 . 


.... 30.3 


1897 ... 


.. 41.2 


1907 . 


.... 37.6 






1888 . 


.... 35.1 


1898 ... 


.. 41.3 


1908 . 


.... 30.1 







Since the. lowest number of inches is 26.5 it is better to make a 
break in the vertical scale, starting the working scale with, say, 25 inches. 



* Report of Columbus Station, U. S. Weather Bureau. 
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6. 


Plot the following data of 


mean mo 


nthly temperatures.* 




1879 . 


.... 53.1 1889 .. 


... 52.2 


1899 .. 


... 53.2 1909 .... 


. 52.0 


1880 . 


.... 53.6 


1890 .. 


... 53.2 


1900 .. 


... 53.8 


1910 .... 


. 51.7 


1881 . 


.... 54.2 


1891 .. 


... 52.6 


1901 .. 


... 51.8 


1911 .... 


. 53.8 


1882 . 


.... 53.4 


1892 .. 


... 51.3 


1902 .. 


... 52.1 


1912 .... 


. 50.8 


1883 . 


.... 51.8 


1893 .. 


... 51.2 


1903 .. 


... 52.0 


1913 .... 


. 53.5 


1884 . 


.... 52.5 


1894 .. 


... 53.3 


1904 .. 


... 50.2 


1914 .... 


. 52.0 


1885 . 


.... 49.1 


1895 .. 


... 51.6 


1905 .. 


... 51.5 


1915 .... 


. 51.8 


1886 . 


.... 50.3 


1896 .. 


... 53.0 


1906 .. 


... 52.7 


191€ .... 


. 52.0 


1887 . 


.... 52.5 


1897 .. 


... 52.9 


1907 .. 


... 50.8 






1888 . 


.... 51.0 


18fi8 .. 


... 53.6 


1908 .. 


... 53.5 







7. Plot the curve of Top Beef Cattle Prices from the following 
data :♦♦ 



1891 .. 


... 7.15 


1898 .. 


... 6.25 


1905 .. 


... 7.00 


1912 .. 


...11.25 


1892 .. 


... 7.00 


1899 .. 


... 8.25 


1906 .. 


... 7.60 


1913 .. 


...10.25 


1893 .. 


... 6.75 


1900 .. 


... 7.50 


1907 .. 


... 8.00 


1914 .. 


...11.40 


1894 .. 


... 6.40 


1901 .. 


... 8.00 


1908 .. 


... 8.40 


1915 .. 


...11.60 


1895 .. 


... 6.60 


1902 .. 


... 9.00 


1909 .. 


... 9.50 


1916 .. 


...13.00 


1896 .. 


... 6.50 


1903 .. 


... 6.85 


1910 .. 


... 8.85 






1897 .. 


... 6.00 


1904 .. 


... 7.65 


1911 .. 


... 9.35 







8. From the data of page 25 plot the 1916 beef cattle prices. 

9. From the data of page 25 plot the 1895 beef cattle prices. 



The Title of a Diagram. Each diagram must be pro- 
vided with a brief and concise and yet accurate and comprehen- 
sive title. The title must cover all of the data and not merely a 
certain section of it and it must do this without being of undue 
length. A careful study of examples of titles is especially help- 
ful in acquiring a notion of what constitutes a proper title. 

All headings of columns must be clear and definite. The 
units of measurement of a scale must always be given ; thus, 
"Precipitation in inches/* "Temperature in degrees". 

Titles and headings have a better appearance when made 
in Roman characters than when made in script. In general the 
size of type in each heading or sub-title should correspond in size 
and prominence to its respective importance. Unless the letter- 
ing is skilfully done by hand it is better to use a typewriter even 
tho different sizes of letters cannot be secured by its use. 



* Report of Columbus Station, U. S. Weather Bureau. 
*♦ Chicago Live Stock World, January 2, 1917. 
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Exercises. 

10. Study the titles and headings of the diagrams and tables of 
Vol. V, Report of the United States Census, 1910. 

11. Study the titles shown in **Graphic Methods for Presenting 
Facts" by Willard C. Brinton.f 

12. Study the titles and headings of the current issue of the 
Monthly Crop Reporter, Department of Agriculture. 

In each of the following exercises construct a complete statistical 
diagram with the curve carefully drawn and an appropriate title de- 
signed for each. 

13. The Land area of the United States exclusive of outlying 
possessions from Table 18, Vol. I, Report of the United States Census, 
1910. 

14. The population of Ohio from Table 10, same report. 

15. Comparative Values of Inside Lots of Different Depths ac- 
cording to the Lindsay-Bernard system of valuation. 



The Lindsay-Bernard and Somers Valuation Schedule.* 



Depth. 
5. 


Lindsay- 
Bernard. 
$9 


Somers. 
$14.35 


Depth. 

85. 

90. 

95. 
100. 
105. 
110. 
115. 
120. 
125. 
130. 
135. 
140. 
145. 
150. 
175. 
200. 


Lindsay- 
Bernard. 

$82.0 

84.2 


Somers. 
$93.33 


10. 


15 


25.00 


95.60 


15 


21 


32.22 


86.2 


97.85 


20. 


27 


41.00 


88 


.... 100.00 


25. 


33 


47.90 


89.6 


102.08 


30. 
35. 


38.5 

44 


54.00 

59.20 


91.1 

92.5 


104.00 

105.78 


40. 


49 


64". 00 


93.8 


.... 107.50 


45. 


54 


68.45 


95 


.... 109.50 


50. 


58.5 


72.50 


96.1 

97.2 


.... 110.50 


55. 


63 


76.20 


111.80 


60. 


67 


79.50 


98.2 


.... 113.00 


65 


70.6 


.... 82.61 


99.2 


114.50 


70. 


73.9 

76.9 


.... 85.60 
.... 88.30 


100 


.... 115.00 


75. 


103 


.... 119.14 


80. 


79.6 


.... 90.90 


105 


.... 122.00 



16. Comparative Values of Inside Lots of Different Depths ac- 
cording to the Somers system. 

17. The accumulated value of $1 at 10% compound interest: 

Year. 123456789 10 

Amount .. 1.00 1.10 1.21 1.33 1.46 1.61 1.77 1.95 2.14 2.36 



fThe Engineering Magazine, 1915, N. Y. 
♦The National Real Estate Journal, May, 1914. 
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18. The Average Yield per Acre for Wheat in the United States 
since 1866; Yearbook, Dep't of Agriculture. 

19. Average Farm Price per bushel of Wheat in the United 
States since 1866; Yearbook, Dep't of Agriculture. 

20. Per cent of Wheat Crop Exported since 1866; Yearbook, Dep't 
of Agriculture. 

21. Total Production of Wheat to nearest 10 million bushels in 
the United States since 1866 ; Yearbook, Dep't of Agriculture. 

22. Substitute the word Corn for Wheat in Exercises 17 to 20 and 
construct the curves. 

23. Bank Clearings of the United States, excluding N. Y. 

Bank Clearings of U. S. excluding N. Y. (in millions). 



1883 $14,209 

1884 12,919 

1885 13,170 

1886 15,513 

1887 17,566 

1888 18,397 

1889 20,280 

1890 23,370 

1891 23,198 

1892 25,660 

1893 23,049 



1894 $21,298 

1895 23,507 

1896 22,304 

1897 23,895 

1898 26,959 

1899 33,416 

1900 33,771 

1901 39,152 

1902 41,695 

1903 43,239 

1904 43,972 



1905 $50,087 

1906 55,327 

1907 57,994 

1908 53,133 

1909 62,249 

1910 66,821 

1911 67,857 

1912 73,209 

1913 75,181 

1914 72,225 



24. Percapita Imports of U. S. : 



1860 $11.25 



1861 


. ... 9.02 


1862 


.... 5.79 


1863 


.... 7.29 


1864 


.... 9.30 


1865 


.... 6.87 


1866 


.... 12.26 


1867 


.... 10.23 


1868 


.... 9.94 


1869 


.... 11.60 


1870 


.... 11.97 


1871 


.... 14.47 


1872 


.... 16.15 


1873 


.... 14.27 


1874 


.... 13.13 


1875 


.... 11.43 


1876 


.... 9.47 


1877 


.... 10.37 


1878 


.... 9.07 



1879 $10.52 

1880 13.88 

1881 13.06 

1882 14.36 

1883 12.81 

1884 11.48 

1885 10.49 

1886 11.57 

1887 12.09 

1888 12.11 

1889 12.58 

1890 13.15 

1891 12.96 

1892 12.91 

1893 11.68 

1894 9.97 

1895 11.60 

1896 9.66 



1897 $10.32 

1898 8.66 

1899 10.68 

1900 10,86 

1901 11.34 

1902 12.30 

1903 12.42 

1904 12.71 

1905 14.24 

1906 15.69 



1907 
1908 
1909 
1910 
1911 
1912 
1913 
1914 



16.29 
12.54 
16.28 
10.94 
16.32 
19.04 
18.47 
18.14 



Note that the data of the two preceding exercises shows a de- 
cided periodicity or wave-like nature. 
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More than One Curve on the Same Diagram. For the 

purpose of comparing different curves it is often convenient 
to plot two or more curves on the same diagram. For instance, 
simultaneous variations in the prices of wheat and corn can be 
observed to good advantage, when the two curves are brot 
together on the same diagram and constructed to the same scales. 
The chief disadvantage of this method of comparing curves lies 
in the resulting complexity of the diagram. If the diagrams 
are constructed on thin paper and the lettering and curves are 
made heavy the different curves when made on separate sheets 
can be readily compared by adjusting one sheet of paper above 
the other. 

Exercises. 

25. Compare the rainfall curve of Exercise 5 with the temperature 
curve of Exercise 6. To what extent do the two curves vary in the 
same directions? What conclusions can be drawn as to the tendency 
for the amount of rainfall to depend on the temperature? 

26. Compare the two systems of real estate valuation of Exer- 
cises 15 and 16. 

27. Give a comparative interpretation of the curves of Exercises 
18 and 19. Why should they not be expected to follow exactly the 
same general course? 

28. Discuss as in Exercise 27 curves of prices and yield per acre 
of corn. 

29. Compare the curves of Exercises 21 and 23. 

Coordinates. It is convenient to have a standardized 
notation for the horizontal and vertical scales. The horizon- 
tal line is denoted by O-X and called the axis of abscissas 
or simply the X-axis. The vertical Hne is denoted by O-Y 
and called the axis of ordinates or the Y-axis. The point 
where the two lines meet is the origin of coordinates. Dis- 
tances along the X-axis are spoken of as x distances or x 
coordinates, and those along the Y-axis as y distances, or 
y coordinates. Thus in the precipitation data of page 7, the 
origin is at January, 1879, and the values of X differ by intervals 
of one month, while the unit interval for Y is one inch. 

Logarithmic Curves. Whenever the data seems to ex- 
hibit a uniform rate of increase or whenever it is desired to 
study the relative changes rather than the actual changes in the 
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data, a logarithmic curve may be of service.* A logarithmic 
curve is obtained by taking the logarithms of the measurements 
and using these logarithms as vertical distances or ordinates. 
Since multiplying two numbers adds their logarithms,a constant 
ratio or rate will appear in the logarithmic diagram as a con- 
stant addition. Hence if there is a constant rate in the data 
the logarithmic curve will be a straight line. Whether the rate 
is constant or not, curves of this type are of value for com- 
paring different rates. However, if the rate is not approximately 
constant considerable familiarity with logarithms is necessary 
if the comparative differences are to be correctly interpreted. 

Exercises. 

30. Plot the logarithmic curve of the data of Exercise 17. 

31. Plot the logarithmic curves of the data of Exercises 15 and 16. 

32. Plot the logarithmic curve of the Chicago Top Beef Cattle 
Prices. 

Cumulative Curves. All the preceding curves show the 
respective values for each interval of the horizontal axis, as 
the production of wheat for each year since 1866 is shown by 
the curve of Exercise 21. Now if it is desired to construct a 
curve exhibiting at each year the total production of wheat since 
1866, the amount of each year's production is added to that 
of all the preceding years and the resultant cumulative sums 
plotted. In this way a curve is obtained which starts at the 
lower left hand corner and proceeds in a diagonal direction 
across the diagram. It is called a cumulative curve. The 
values to be plotted will be, in the case of the cumulative curve 
of wheat production, 150,000,000, 360,000,000, 580,000,000, 
840,000,000, etc. 

Exercises. 

33. Plot the cumulative curve of wheat production. 

34. Plot the cumulative curve of corn production and compare with 
the curve of Exercise 33. 

35. Of what significance is the slope of a cumulative curve? 



♦See "The Ratio Curve," Fisher. Quarterly Publications American 
Statistical Association, June, 1917. 



CHAPTER II. 
CURVE PLOTTING— (C(7«/mM^rf.) 

Interpolatioii. The curves of the preceding chapter were 
drawn for the purpose of connecting the plotted points in order 
to assist the eye in following the course of the data across the 
diagram. However, other uses can be made of a statistical 
curve. 

At the beginning of Chapter I the data of monthly pre- 
cipitation is given. What was the weekly precipitation? The 
Chicago Top Beef Cattle monthly prices are given under 
Exercise 7, Chapter I. What were the weekly prices during 
the period covered by that data? The population of the 
United States is given for ten-year intervals. What has been 
the population from year to year? These are essentially 
questions of interpolation, that is, of estimating values lying 
between the given values. 

The method of obtaining intermediate values from the 
curve consists merely of measuring on the vertical scale the height 
of the curve at the required point. Thus with the population 
curve of Exercise 4, Chapter I, which is constructed from the 
decennial census reports, the population for the year 1906 is given 
by the height of the curve above the 1906 point on the horizontal 
scale. 

Exercises. 

1. Estimate the Top Beef Cattle Prices for each week in February 
1916, from the monthly data of Exercise 7, Chapter 1. 

2. Estimate the values of inside lots for the fraction of a foot, say 
67.5 feet, from the data of Exercise 16 of the preceding chapter. 

3. What is, according to the data of Exercise 17 of Chapter I, the 
compound amount of $1 for 7.5 years at 10%? 

This method of interpolating makes an estimated value 
depend on the two consecutive given values which inclose it. 
But the increase in population during a decade may have oc- 
curred almost entirely during the last years of the period and 

(16) 



INTRODUCTION TO MATHEMATICAL STATISTICS 1 7 

yet the shape of the curve when drawn merely to connect the 
ten-year points may give no hint of this irregularity of increase. 
The temperature for one month may have no connection with 
that of the preceding month and hence the curve between the 
points, depending as it does on the two non-related values can 
hardly be expected to give the actual temperature for an inter- 
mediate week or day. If the price of wheat for the year 1905 
is omitted can it be reliably estimated by drawing the curve 
from the years 1904 and 1906 and then interpolating for the 
missing year? 

It must be apparent therefore that a curve which passes 
thru a series of more or less non-related points can be of little 
value in interpolation and that the problem of interpolation is 
essentially one of determining by some means or other the 
general course of the data and then estimating the intermediate 
values in conformity with this general trend. The values ob- 
tained in this way are the most probable values ; accidental varia- 
tions which bear no relation to the underlying tendencies can not 
be so estimated; in fact such variations can not be estimated 
or predicted by any means. 

The Smoothing of a Curve. The curves of Chapter I, 
drawn as they are thru each point, preserve all the variations 
whether they are fundamentally essential or due merely to the 
presence of accidental influences. The curve of mean monthly 
temperatures, Exercise 6 of the preceding chapter, shows dis- 
tinct seasonal variations in temperature — higher temperatures in 
summer and lower in winter. Along with these essentially 
significant changes are fluctuations apparently accidental as, in 
one year June is warm and in another relatively cool ; some- 
times January is warmer than February and sometimes the 
reverse is true. 

To represent a general movement or trend the curve 
must be drawn without abrupt changes in direction and must 
sweep among the points rather than necessarily thru each 
point. Since such a smoothed curve, as it is called, depends 
on the general or collective characteristics of the data the draw- 
ing of it must be based on collective properties of the measure- 
ments. One pertinent general property has just been stated; 
namely, that the curve must be smooth, that is, not have abrupt 
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changes in direction. This property expresses the statistical 
assumption that the significant variations are fairly uniform 
from value to value and not capricious or arbitrary. A second 
assumption, which is presently discussed, is that certain areas 
are relatively stable and unchanging. 

Smoothing by Inspection. The smoothing of a curve may 
be based on a study of the data and made a matter of the skill and 
experience of the statistician without the assistance of definitely 
stated assumptions or properties. The curve is then said to be 
smoothed by inspection. 

In smoothing a curve the first step is to study the data 
carefully. Without such an investigation into the probable 
sources and extent of the irregularities and fluctuations one 
cannot hope to know what irregularities to smooth out and what 
to leave in. A curve cannot be reliably smoothed by a statis- 
tician who does not know the data thoroly. On the basis of 
the information gained by this study a preliminary curve should 
then be drawn freehand among the points. By successive 
erasures and redrawings the finished curve can gradually be 
arrived at. Thus a curve showing the long time movements in 
the price of wheat will pass above some points and below others 
and how much the curve should miss any point can not be deter- 
mined without a knowledge of financial conditions, yields, etc. 

The inspection method of smoothing a curve is often suf- 
ficiently accurate for all practical purposes, especially when 
done by a statistician of experience and especially when there 
is a considerable element of inaccuracy inherent in the data. 
Its disadvantage lies obviously in the fact that no two smooth- 
ings of the same curve will be exactly alike; the method is es- 
sentially tentative and personal. 

In any event a rough preliminary draft of the curve should 
be made by inspection before proceeding to apply more re- 
fined methods. 

Exercises. 

4. Smooth the illustrative data at the beginning of Chapter I. 

5. Smooth the data of the population of the United States as given 
in Exercise II, Chapter I. 

6. Smooth the data of annual rainfall of Exercise 5, Chapter I. 

7. Smooth the data of Exercises 18, 19. 20 and 21 of Chapter I. 
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Fig. IV. The Smooth Curve of Monthly Precipitation 
at Columbus, 1916. 

The Preservation of Areas. In the illustrative data at the 
beginning of Chapter I the precipitation of 4.9 inches in March 
is the total precipitation for the whole month. With a base of one 
unit, then a rectangle of height 4.9 will have an area equal to the 
total precipitation. Likewise the rectangle on the July unit as a 
base will have an area equal to 0.7, which is the July precipitation. 
The prices of Exercise 7, Chapter I, can in a similar manner be 
represented by rectangles with heights equal to the respective 
prices and with unit bases. The population data of Exercise 2 
of the same chapter may be represented by rectangles which are 
not adjacent and have nine rectangles omitted between successive 
census years. 

After the curve is smoothed each rectangle will be 
altered so as-to have a curved top. The total area under the 
finished curve will then be the sum of the areas of the modi- 
fied rectangles. The First Rule of Preservation of Areas is 
that the ci*rve should be so smoothed that the total area 
under the resulting curve is equal to the sum of the areas of the 
original rectangles. Since, for instance, the monthly precipita- 
tion is made up of the sum of the daily precipitations it is like- 
wise reasonable to assume that the monthly sum is more 
stable than is the daily or weekly and hence we have the 
Second Rule of the Preservation of Areas; namely, where 
possible, the areas of the individual rectangles are to remain 
unchanged. This can be done by adding to and subtracting from 
each rectangle an equal sum. 

Within the requirement that the curve must be free from 
abrupt changes in direction the two preceding working rules 
furnish a fairly comprehensive basis for the smoothing of 
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Statistical data. In later chapters more detailed rules will be dis- 
cussed and applied. However, for most data the present rules 
are sufficient. 

As explained for the precipitation data a definite statistical 
meaning can usually be found for the rectangles. Even when a 
significance is with difficulty ascribed to the rectangles they should 
be drawn and the same rules applied to the smoothing as before. 
The method is in such cases justified wholly by its practical con- 
venience. 

In the illustrative plotting, at the beginning of Chapter I, 
of the data of monthly precipitation at Columbus for the year 
1916, the vertical scale was laid off on a line thru the January 
point. In constructing the rectangles for smoothing, it is con- 
venient to have the January and other perpendiculars at the 
middle of the respective intervals in order that there may be a 
half unit's space at the left of the beginning point. The zero 
point on the horizontal scale is then at the beginning of the first 
interval and the vertical distance for the first point is taken not 
on the vertical scale line but perpendicularly above the mid-point 
of the interval. Whenever the curve is to be smoothed the scale 
is marked off in this way; ordinarily the method of Chapter I is 
employed where the curve is not to be smoothed. 

The following diagram illustrates the application of the 
rectangle method of smoothing to the monthly precipitation data. 



Fig. V, The Rectangle Method of Smoothing the 
Monthly Precipitation data for Coiumbiis in 11116. 

Exercises. 

8. Construct the smoothed curve of prices from the data of Ex- 
ise 7. Chapter I. 
fl. Do the same for the data of Exercise 5 and of Exercise 6, of 
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10. Do the same for the data of Exercises 18, 19, 20 and 21 of the 
same chapter. 

11. Do the same for the data of Exercise 22 of the same chapter. 

12. Can the rules of permanence of areas be applied effectively to 
the drawing of the curve for the data of Exercise 17 of the preceding 
chapter? Why? To the data of Exercises 15 and 16 of the same 
chapter ? 

13. In drawing the smooth curve of decennial census population it is 
advisable to alter the original data very slightly, if at all. Discuss. 

A common way of drawing this curve is to connect the ten year points 
by a series of straight lines and then round out the an^es where the 
lines intersect. This assumes a uniform annual increase during the de- 
cade — an assumption which may or may not be true. 

14. The statistical significance of the rectangles has been discussed 
for the precipitation data. Develop the corresponding explanation for the 
decennial census data. 

15. Show that in the data of Exercises 15, 16 and 17 of the preceding 
chapter the rectangles are not significant. 

The Adjusted Data; Interpolation. Since in general it 
is impossible to preserve exactly the area of each rectangle 
the process of smoothing will lead to values differing from 
those of the original data. Consequently, the data is said to 
be adjusted or graduated or smoothed by means of the curve. 
In accordance with the reasoning at the beginning of this 
chapter the adjusted values are to be taken as giving a more 
significant idea of the true trend of the data than does the 
original data. It is evident that we have here the solution 
to the problem of interpolation. Therefore, the rule for 
interpolation is: to obtain the value at any point on the hor- 
izontal scale measure the corresponding ordinate of the smoothed 
curve, or measure the proper area under that curiae. Thus the 
rainfall during the first week in June is obtained by measuring 
the area under the curve on the first one-fourth of the June base 
unit. 

Test of a Graduation. The extent to which smoothing 
preserves the areas of the individual rectangles is often taken 
as a test of the appropriateness of the smoothing or gradua- 
tion. The smoothed curve is said to fit the data and the 
term "goodness of fit" is used to denote the appropriateness 
of the methods used in the process of constructing the smooth 
curve. The goodness of fit is then measured by the extent to 
which the areas of the individual rectangles are preserved. In 
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applying this test two columns of numbers are set down, in one 
the original values and in the other the adjusted values. The 
differences are then taken and studied. Other conditions being 
equal the smoothing with the smallest differences is the best, tho 
the judging of goodness of fit is largely a matter of experience. 

Exercises. 

16. Discuss the goodness of fit of each of the curves smoothed in 
the preceding exercises. 

17. What is the best estimate on the basis of the data of page 25 
of the Top Beef Cattle Prices for the first week in February, 1916? 

Note that in this data the rectangles have no special statistical sig- 
nificance. 

18. From the data of Exercise 23 of the preceding chapter what is 
the best estimate of the bank clearings in the United States for the first 
half of the year 1908? 

19. What is the significance of the rectangles in the case of the 
data of Exercises 14, 15, 16 of Chapter 1? 

20. In drawing the curves of Exercise 19 should the values be ad- 
justed? Are these curves drawn by a process of smoothing? 

Determining the General Trend of the Data. The char- 
acteristics of a movement over a number of years can be deter- 
mined from the smoothed curve. Thus the general upward 
trend of prices during the years 1897 to 1917 is shown by the 
rise of the curve. 

Perhaps the best way to picture a general movement in the 
data is to draw a straight line, or more than one straight line 
where there seems to be more than one distinct movement, to fit 
the data. That is, to smooth the data with a straight line. With 
data not conforming closely to a straight line there is likely to 
be some uncertainty in the exact location of the straight line or 
lines but since the lines are but the pictures of the ideas of gen- 
eral increases or decreases the uncertainty is neither greater nor 
less than is the uncertainty in the ideas of the general movements 
themselves. The difficulty, in reality, is due to a lack of in- 
formation regarding the data. The methods of Chapter X are 
of much service in this connection. 

Exercises. 

21. During the last 37 years has there been an appreciable increaa 
or decrease in the precipitation at the Columbus Station? 

22. During the same time has there been a decided upward c 
downwaTd Wovehient in temperatures at the same place ? 
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Periodic Data. In smoothing and determining the gen- 
eral trend of data care must be taken that the data is not 
smoothed to conform to a straight line when there is an inherent 
periodicity in the material. The data of Exercises 23 and 24 of 
Chapter I exhibit significant tendencies for the values to be high 
for a few years and then consistently lower for a few years and 
then higher, and so on, thru more or less regular and uniform 
cycles. In smoothing such data the ideal should be to determine 
a uniform cycle and then smooth the data into the curve made 
up of the determined cycles. The problem of smoothing such 
data is complicated by the fact that the curve in addition to being 
composed of a series of similar loops or arches also has a ten- 
dency to rise or fall. Thus the imports of the U. S. have in- 
creased on the whole during the last 50 years tho there have been 
increases and decreases following each other in fairly regular 

periods. 

Exercises. 

23. Smooth the data of Bank Clearings as given in Exercise 23 of 
the preceding chapter. 

24. Smooth the data of Imports as given in Exercise 24 of the pre- 
ceding chapter. 

25. To what extent has there been a tendency for bank clearings 
and for imports to increase during the period covered by the given 
data? 

26. Discuss the periods in the yield per acre of wheat in the U. S. 

27. Do the same for the production of wheat. 

28. Summarize the uses and advantages of the smooth curve as 
compared with the curve which passes exactly thru each point. 



CHAPTER III. 
FREQUENCY CURVES. 

Definitions. The following data of the measures of 
heights of 750 students* may be taken for purpose of illustration. 

The measurements are classified to show the number of in- 
dividuals for each inch of height. 

Height. Number, Height Number, 

61 2 68 126 

62 10 69 109 

63 11 70 87 

64 38 71 75 

65 57 72 23 

66 93 73 9 

67 106 74 4 

750 
Table I. 

Height, the attribute or characteristic here under con- 
sideration, is in this table measured to the nearest inch, giving 
a group or class interval of one inch. A class interval or class 
is ordinarily designated by the value of its middle measure- 
ment, and the class limits are located on either side at a half 
unit's distance from this mid-value. All individuals, for in- 
stance, with height between 67.5 and 68.5 belong to class 68; here 
the limits are 67.5 and 68.5 and the class is designated by the 
number 68. Instead of using 61, 62, 63, etc., as class numbers, 
the classes may be simply numbered i, 2, 3, etc., and these 
numbers used as class numbers. Again, the classes may be 
numbered in both ways from some point within the range, 
as 68. This would give class numbers as follows : — 7, — 6, 
— 5» —4, — 3» —2, —I, o, +1, +2, etc. 

The objects measured or enumerated are referred to as 
variates or simply as individuals. 



♦Records of physical measurements at Ohio State University Gym- 
nasium, Freshman class, 1913. 

(24) 
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The size or frequency of a class is the number of indi- 
viduals within that class, and the total frequency is the sum 
of all the class frequencies. The table as a whole constitutes 
a frequency distribution of height, and shows the number of 
times each class occurs. 

To illustrate the method of constructing a frequency dis- 
tribution let us take the following data : * 

Chicago Monthly Top Beef Cattle Prices. 



Year. 


Jan. 


Feb.. 


Mar. 


Apr. 


May. 


June. 


July. 


Aug. 


Sept. 


Oct. 


Nov. 


Dec. 


1916 


... $9.85 


$9.75 $10.05 $10.00 $10.90 $11.50 $11.30 $11.50 $11.50 $11.60 $12.40 $13.00 


1915 


... 9.70 


9.50 


9.15 


8.90 


9.65 


9.95 


10.40 


10.50 


10.50 


10.60 


10.55 


11.60 


1914 


... 9.50 


9.75 


9.75 


9.55 


9.60 


9.45 


10.00 


10.90 


11.05 


11.00 


11.00 


11.40 


1913 


... 9.50 


9.25 


9.30 


9.25 


9.10 


9.20 


9.20 


9.25 


9.50 


9.75 


9.85 


10.25 


1912 


... 8.75 


9.00 


8.85 


9.00 


9.40 


9.60 


9.85 


10.65 


11.00 


11.05 


11.00 


11.25 


19n 


... 7.10 


7.05 


7.35 


7.10 


6.50 


6.75 


7.35 


8.20 


8.25 


9.00 


9.25 


9.35 


1910 


... 8.40 


8.10 


8.85 


8.65 


8.75 


8.85 


8.60 


8.50 


8.50 


8.00 


7.75 


7.55 


1909 


... 7.50 


7.15 
6.25 


7.40 
7.50 


7.15 
7.40 


7.30 
7.40 


7.50 
8.40 


7.65 
8.25 


8.00 
7.90 


8.50 
7.85 


9.10 
7.65 


9.25 
8.00 


9.50 


1908 


... 6.40 


8.00 


1907 


... 7.30 


7.25 


6.90 


6.75 


6.50 


7.10 


7.50 


7.60 


7.35 


7.45 


7.25 


6.35 


1906 


... 6.50 


6.40 


6.35 


6.35 


6.20 


6.10 


6.50 


6.85 


6.95 


7.30 


7.40 


7.90 


1905 


... 6.35 


6.45 


6.35 


7.00 


6.85 


6.35 


6.25 


6.50 


6.50 


6.40 


6.75 


7.00 


1904 


... 5.90 


6.00 


5.80 


5.80 


5.90 


6.70 


6.65 


6.40 


6.55 


7.00 


7.30 


7.65 


1903 


... 6.85 


6.15 


5.75 


5.80 


5.65 


5.15 


5.65 


6.10 


6.15 


6.00 


5.85 


6.00 


1902 


... 7.75 


7.35 


7.40 


7.50 


7.70 


8.50 


8.85 


9.00 


8.85 


8.75 


7.40 


7.75 


1901 


... 6.15 


6.00 


6.25 


6.00 


6.10 


6.55 


6.40 


6.40 


6.60 


6.90 


7.25 


8.00 


1900 


... 6.60 


6.10 


6.05 


6.00 


5.85 


5.90 


5.85 


6.20 


6.15 


6.00 


6.00 


7.50 


1899 


... 6.30 


6.25 


5.90 


5.85 


5.75 


5.75 


6.00 


6.65 


6.90 


7.00 


7.15 


8.25 


1898 


... 5.50 


5.85 


5.80 


5.50 


5.50 


5.35 


5.65 


5.75 


5.85 


5.90 


6.25 


6.25 


1897 


... 5.50 


5.40 


5.65 


5.50 


5.45 


5.30 


5.25 


5.50 


6.00 


5.40 


6.00 


5.65 


1896 


... 5.00 


4.75 


4.75 


4.75 


4.55 


4.65 


4.60 


5.00 


5.30 


5.30 


5.45 


6.50 


1895 


... 5.80 


5.80 


6.60 


6.40 


5.25 


6.00 


6.00 


6.00 


6.00 


5.60 


5.00 


5.50 



The width of the classes must be first determined. It 
would be possible to have a class for each quotation but it 
would be found highly inconvenient. The error introduced by 
the grouping of the measurements, the quotations in this case, 
is ordinarily of no practical significance. A general rule in 
determining the width of the classes, and hence of the number of 
classes* is to make as wide classes as is practically feasible — 
the number of classes is perhaps most often from ten to twenty. 
In this case the width is taken as fifty cents and the limiting 
quotations of each class are included in the class. 

The data is examined and a score made for each occurrence 
of the class. Thus Class I with the range 450-499 appears Feb., 
Mar., Apr., May, June, July, 1896; as an occurrence is observed 

♦Yearbook, Chicago Live Stock World, 1917. 
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a mark or score is made — six in all. After the scoring is com- 
pleted the frequency of each class, that is, the number of tallies 
or scores for each class, is noted and written in a column. 

The frequency distribution just obtained shows the number 
of times each price-class has occurred during the last twenty- 
one years. 

Exercises. 

1. Construct a frequency table from the Top Beef Cattle Prices 
using class intervals of twenty-five cents and compare with the distribu- 
tion obtained when the class interval is fifty cents. 

2. From the following table of Mean Monthly Temperatures at the 
Columbus Station construct a frequency table with a class width of five 
degrees. 

Year. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 

1878 78.6 74.0 65.9 53.5 43.1 26.2 

1879 25.6 28.9 41.6 50.3 64.1 71.4 78.4 71.2 61.4 82.4 43.8 37.2 

1880 43.8 38.8 40.8 53.8 68.8 72.8 75.1 74.2 65.4 52.2 32.8 24.8 

1881 24.2 29.2 36.8 47.6 67.8 69.7 78.6 75.8 74.8 60.5 44.4 40.6 

1882 32.6 41.8 44.8 51.0 57.4 69.9 71.8 72.2 66.2 59.0 42.6 32.0 

1883 27.1 34.3 35.3 51.0 60.3 70.7 74.1 70.2 64.0 55.6 44.7 34.8 

1884 20.0 37.2 39 2 49.4 61.7 72.9 73.8 72.8 71.0 58.9 41.4 32.2 

1885 22.9 19.4 30.1 50.0 61.2 69.0 76.6 71.1 64.0 61.4 40.9 32.4 

1886 23.4 26.8 39.0 54.7 62.8 67.3 72.2 71.6 65.8 54.4 38.8 27.2 

1887 26.8 36.4 37.3 50.8 67.6 71.2 79.6 70.8 64.0 51.4 41.7 32.8 

1888 26.8 33.0 36.5 51.2 60.6 71.6 73.2 71.4 61.3 48.7 44.1 34.2 

1889 34.2 26.4 42.2 61.8 61.4 67.7 74.1 70.2 63.8 49.0 41.2 44.8 

1890 39.1 40.6 35.2 52.3 60.0 74.6 73.6 70.2 63.1 53.8 44.6 31.8 

1891 33.0 36.8 34.8 52.9 57.6 72.4 70.0 71.0 69.4 52.8 40.3 40.0 

1892 24.0 35.7 36.0 49.4 61.0 74.2 74.0 73.0 66.4 53.6 38.2 30.0 

1893 18.8 30.8 39.5 51.6 59.4 71.4 76.4 71.8 66.7 55.1 40.0 33.0 

1894 34.7 29.4 46.2 51.5 60.6 72.4 75.2 72.2 69.1 54.8 39.0 34.6 

1895 24.1 21.0 36.7 53.2 62.8 74.9 73.8 75.6 71.2 48.2 42.4 34.9 

1896 30.8 31.8 33.7 58.8 69.7 70.8 74.4 72.9 63.2 50.4 45.2 36.4 

1897 26.4 34.0 43.1 50.6 57.9 69.8 77.2 71.1 68.8 59.8 42.7 33.7 

1898.... 33.2 31.5 46.3 48.6 62.7 73.8 77.7 75.0 69 8 55.0 39.6 30.0 

1899 29.4 22.8 38.3 55.2 65.4 73.4 76.2 75.8 65.7 59.2 45.3 31.1 

1900 32.8 26.8 34.5 51.8 65.4 71.9 76.2 78.5 71.2 62.2 42.4 32.5 

1901 30.4 23.5 40.6 48.5 61.1 73.4 79.9 74.6 66.6 55.7 38.6 28.7 

1902 28.8 23.2 42.7 50.1 65.0 68.2 75.2 70.6 65.2 56.2 49.8 30.7 

1903 28.3 31.8 47.8 51.2 65.8 66.0 74.5 73.0 67.6 55.4 38.4 24.8 

1904 22.8 24.8 40.9 45.0 62.1 69.6 73.4 71.0 67.1 54.0 42.4 29.0 

1905 24.0 22.0 44.4 50.4 62.3 70.8 74.9 73.3 67.1 53.7 40.6 34.2 

1906 36.6 28.4 32.0 54.7 62.8 71.0 73.2 75.7 70.0 53.0 42.4 32.7 

1907 33.5 27.2 46.8 43.0 55^7 67.2 73.9 71.0 66.7 50.0 40.2 34.6 

1908 30.0 29.6 44.5 51.7 64.0 70.6' 75.6 72.7 70.4 55.6 42.8 34.3 

1909 32.8 36.2 38.1 50.1 59.8 71.8 72.0 73.4 64.1 49.9 49.6 25.8 

J910. ... ».2 26.2 SO.l 52.6 57.1 68.4 75.2 73.4 67.6 57.8 37.0 26.5 
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Year. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 

1911 33.5 35.4 38.3 49.0 68.6 72.8 75.7 73.9 68.8 54.1 38.2 37.4 

1912 19.2 23.4 34.2 43.4 64.2 68.2 74.9 70.7 68.0 56.2 42.6 34.4 

1913 36.8 27.2 40.8 50.4 61.7 71.8 76.5 75.4 65.4 54.4 45.8 35.5 

1914 34.0 23.6 36.5 50.7 63.8 72.6 75.9 74.0 64.8 57.7 42.9 27.6 

1915 27.8 36.0 34.0 56.3 58.2 68.0 73.0 M.2 68.0 56.7 44.4 31.0 

1916 36.0 26.8 35.5 49.9 62.6 65.9 78.6 75.8 63.9 55.2 43.4 30.4 



3. Construct the frequency table of the following data of monthly 

precipitation at the Columbus Station. Take one-half inch for the class 
width. 

Year. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 

'X%y/ Oaaa*****. •••• •••• •••• •••• •••• •••• O* DO • \A/ ^ • 04 W • Xw W • IA/ O • 09 

1879 1.66 1.43 3.77 0.92 2.09 2.68 3.67 4.64 2.33 t.ZS 3.52 4.29 

1880 4.49 1.70 2.42 5.08 3.21 3.30 4.86 6.95 1.80 2.35 4.54 3.98 

1881 2.25 4.44 4.01 2.04 2.00 4.02 5.33 2.09 1.54 8. 94 5.35 5.2S 

1882 4.69 5.94 4.76 4.87 9.S9 6.01 2.62 3.14 2.91 2.44 2.05 2.23 

1883 3.20 6.18 3.20 2.85 6.38 4.25 3.75 2.54 2.43 6.11 3.87 4.12 

1884 2.25 4.95 3.59 2.11 3.79 2.59 2.16 0.70 3.46 1.66 0.99 2.77 

1885 3.75 2.39 0.53 4.61 5.83 5.08 3.28 5.90 2.84 3.11 3.08 1.85 

1886 -4.36 1.26 3.90 3.57 7.67 2.69 4.17 2.44 3.61 1.13 4.18 3.41 

1887 2.35 6.48 2.56 3.44 2.97 2.82 1.45 2.21 1.35 0.30 2.45 1.87 

1888 3.73 1.30 3.79 1.53 3.89 1.62 5.81 4.34 0.91 3.77 3 26 1.11 

1889 3.37 1.06 0.66 0.83 3.92 2.77 2.94 1.59 3.34 1.83 3.83 2.36 

1890 5.73 6.12 5.63 4.32 5.12 4.95 1.80 2.75 7.13 3.02 1.97 2.19 

1891 2.84 5.42 4.64 2.26 2.73 4.98 4.69 2.64 1.05 2.94 5.44 2.42 

1892 2.21 3.35 2.23 2.67 3.58 4.96 3.31 5.12 1.47 0.84 2.20 1.60 

1893 2.25 7. 65 1.92 7.08 4.81 2.89 1.27 1.65 1.14 3.33 2.16 1.97 

1894 2.42 3.11 1.79 1.79 2.78 1.12 1.74 2.64 5.31 1.93 1.91 2.95 

1895 4.67 0.64 1.23 4.12 1.73 2.94 1.45 2.10 1.48 0.92 5.32 4.14 

1896 2.34 1.93 3.04 2.70 2.61 3.38 t.47 3.53 5.93. 0.55 3.53 1.52 

1897 1.54 3.71 5.45 4.27 3.68 2.45 6.95 1.95 0.82 0.36 7.54 2.43 

1898 5.29 1.67 7.03 2.05 6.04 1.63 2.33 7.16 1.77 2.95 2.30 1.09 

1899 2.35 1.44 4.69 1.18 2.25 1.26 4.85 1.49 2.01 2.23 1.72 2.98 

1900 3.01 3.30 2.59 1.76 1.82 2.45 3.89 3.02 0.97 2.86 3.71 0.» 

1901 1.50 0.88 1.82 2.21 4.24 6.31 1.23 1.71 2.10 0.33 0.59 3.61 

1902 1.56 0.51 2.63 1.60 0.95 8.52 4.70 1.62 4.16 1.85 2.72 3.41 

1903 2.11 4.44 4.13 2.47 2.18 3.07 2.05 0.67 1.46 1.84 2.01 1.71 

1904 2.80 8.12 4.93 2.49 4.01 3.86 2.48 3.18 0.83 0.97 0.18 3.63 

1905 1.25 1.57 5.87 3.15 4,38 2.78 2.27 5.45 3.36 5.45 1.64 1.87 

1906 1.98 1.08 4.59 1.16 2.47 1.44 5.27 6.15 1.59 2.07 2.57 3.33 

1907 5.73 6.43 5.21 3.27 3.35 3.39 6.07 2.47 2.27 1.59 1.68 1.85 

1908 1.40 3.66 6.03 2.75 4.04 2.13 3.74 2.34 0.42 1.20 0.84 1.59 

1909 2.52 4.97 2.68 3.20 4.65 3.88 3.34 2.53 1.81 2.77 1.66 2.58 

1910 5.11 5.05 0.28 2.52 4.10 2.93 2.40 0.42 3.66 5.22 0.79 2.31 

1911 4.46 1.71 2.36 4.37 1.15 4.04 3.29 3.62 5.98 5.21 2.71 4.53 

1912 1.58 1.53 4.56 4.20 2.65 1.48 3.50 2.25 2.83 1.71 1.01 2.34 

1913 6.63 2.09 8.69 3.91 2.60 1.56 2.88 2.10 3.28 2.05 4.56 1.13 

1914 2.21 3.70 2.46 2.48 1.28 2.03 1.64 4.78 1.26 4.44 1.99 2.91 

1915 3.30 1.52 1.19 0.95 2.57 5.06 6.85 7.01 4.43 0.94 1.97 4.15 

1916 5.02 1.47 4.88 2.33 4.81 3.49 0.66 3.22 1.54 1.84 1.58 3.59 
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4. Study the frequency distributions of population with j:espect to 
age; Report of the Thirteenth Census, 1910, Chapter IV^, Vol. 1, with 
special reference to the size of the various class intp^als and note two 
genera l torms of stating the frequencies of tbe classes. 

5. Examine the different forms of frequency distributions appearing 
in the report of the Medico- Actuarial Society's Investigations, Vols. I, 
II, III, IV; also in Biometrika, Agricultural Experiment Station Bulletins 
and in other accessible sources. 

6. In which of the exercises of Chapter I is the data in the fre- 
quency distribution form? 

Plotting a Frequency Distribution. The illustrative data 
at the beginning of this chapter is plotted by locating 14 equidis- 
tant points on a horizontal line, one for each height class from 
61 inches to 74 inches inclusive. Then at the middle of each 
interval so obtained a vertical line is erected with a height pro- 
portional to the corresponding class frequency. In this way a 
point is obtained for each class. 

As in Chapter II, a rectangle is constructed on each in- 
terval. It must be apparent that a rectangle in the case of 
the frequency distribution has in every case a significant statis- 
tical meaning — it is the frequency of the class. Hence the sum 
of the areas of all the rectangles is the total frequency of the 
distribution. 

Smoothing the Frequency Curve. With the rectangles 
drawn, the smoothing of a frequency distribution is in no wise 
different from the smoothing of the data discussed in the preced- 
ing chapter. However, for the frequency curve the two rules 
of the permanence of areas have a stronger justification because 
of the more definite significance of the areas under the curve. 

With practice in the construction of statistical diagrams 
and curves the rectangles may be dispensed with and the 
curve drawn by inspection, especially when the data contains 
a large element of uncertainty. Also the broken line obtained 
by joining the ends of the ordinates, called the freuency poly- 
gon, may be smoothed by inspection into the required curve. 

Exercises. 

7. Smooth the illustrative data at the beginning of this chapter. 

8. Smooth the frequency distribution of Chicago Top Beef Cattle 
Prices for 50-cent intervals. 

9. Tabulate the same data to show the distribution for 25-cent 
classes. 
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10. Construct the smoothed frequency curve for the distribution of 
temperature and of precipitation at Columbus since 1879. 

11. From data obtained from a financial paper construct the fre- 
quency distribution of the prices of preferred stocks for any one market 
day. 

12. Do the same for a common stock. 

13. Draw the smoothed curve of the following weight distribution of 
Ohio State University freshmen. 

Weight Class — 102 107 112 117 122 127 132 137 142 147 152 
Frequency — 8 13 20 48 76 93 93 110 93 49 56 

Weight Class — 157 162 167 172 177 182 187 
Frequency — 31 22 13 11 3 2 9 

The weight classes are here of width five pounds and the middle 
value of each class is taken as the class number. Class 187 includes all 
persons with weight/ greater than 184. 

14. Construct the smooth curve of the distribution of ages of grad- 
uates from the Columbus Public Schools. 

Ages — ]1 12 13 14 15 16 17 18 
Numbers— 7 45 186 114 61 8 

13. Construct the frequency curve of the preferred stock data of 
Exercise 11. 

14. Do the same for the common stock data of Exercise 12. 

Use of the Frequency Curve. The frequency curve does 
not give a chronological picture of the variations in the data. 
Instead it shows the number of times that each value occurs. 
The frequency curves of precipitation for a dryer climate is 
located to the left of that for a more moist climate because 
months with small precipitation occur more frequently in the 
dryer region. The frequency curve of higher prices lies further 
to the right than does that of lower prices, so that by con- 
structing the frequency curves it can be readily discovered which 
series of prices tends to be higher. 

Exercises. 

15. Compare the Top Beef Cattle Prices of 1895 with those of 
1915. 

16. Compare the precipitation at Columbus with that of some other 
station. 

Typical or Representative Data. Statistical data may 
be collected for the express purpose of exhibiting a chronological 
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or other statement of the variations. This sort of data is usually 
based on the complete enumeration of a given set of objects, as 
the census of population to apportion the members of the House 
of Representatives, or measures of stature for military purposes. 

In discussing an increase in prices it is impossible to quote 
all prices; recourse must be had to a carefully selected list of 
prices. The condition of trade in certain industries is taken as 
indicative of the condition of all business. In comparing the 
prices of beef and the prices of corn the real object of investiga- 
tion is to find an underlying connection between the two series 
of values — a connection which will hold good in any particular 
year. In such a study the historical statistics of the two price 
variations are in reality used as representative, as typical, of 
the manner in which the two prices are related. It is apparent 
that the frequency form of distribution is peculiarly adapted to 
typical data. 

The Errors in Representative Data. The theory of 
enumerative statistics is simple in statement; the chief cares 
of the statistician are that all objects are counted and none 
counted more than once, and that an adequate and effective 
method of presentation is adopted. There are also complicated 
questions of the methods of collecting the data and of the limits 
of accuracy of the data but these are met with in data of either 
form. 

Because it is practically impossible to secure homo-' 
geneous data; that is, data in which the values for all char- 
acteristics except those under consideration are the same for 
all variates, representative data must be examined for homo- 
geneity. For instance, the persons whose heights are tabulated 
at the beginning of this chapter differ in age, early environ- 
ment, physical condition, as well as in height so that the given 
distribution is in reality a distribution of a complex of attributes 
instead of merely the one attribute, height. Unless the influence 
of these various factors is carefully studied, serious errors may 
result from attempting to apply to another distribution the con- 
clusions drawn from this distribution. 

It may be shown that, from absolutely homogeneous mate- 
rial successive samples made up strictly at random, that is, with- 
out bias or prejudice, will most likely give materially differing 
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distributions. The extent of such errors must be understood in 
reasoning from one distribution to another. 

Hence in working with typical or representative data care 
must be taken regarding (i) the limits of accuracy of the data; 
(2) the homogeneity of the data; (3) the errors of random 
sampHng. 



CHAPTER IV. 

The Arithmetic Mean. Let us add the January prices 
in the data of page 25, and then divide the sum by the number 
of items. The resuh is $8.33. In this way a number, the arith- 
metic mean, is obtained. The characteristic arithmetic property 
of this number is that each of the given data values may be 
replaced by it without altering the total sum of all the values. 

It is usual to speak of the arithmetic mean simply as the 

mean unless, in order to distinguish the arithmetic mean from 

some other mean, there is special need for the defining word 

''arithmetic/' 

Exercises. 

1. Determine from the date of Exercise 1, Chapter I, the arithmetic 
mean of the monthly rainfall at Columbus, for March. 

2. Determine from the data of Exercise 5, Chapter I, the arithmetic 
mean of the annual precipitation at Columbus. 

3l( Find from the data of page 25 the arithmetic mean of the 1895 
Top Beef Cattle prices and compare with the 1915 mean. 

4. On the assumption that the population of the United States 
increased uniformly from 1900 to 1910 find the value of the annual 
increase and then the estimated population for 1906. 

5. Compute the arithmetic mean of the Monthly Top Beef Cattle 
prices for the years 1895 to 1916. 

6. By first assigning each monthly price to the appropriate 50-cent 
class as on page 25 and computing the arithmetic mean of the prices 
when so altered determine the effect on the value of the arithmetic 
mean of substituting the class prices for the exact values. Use the 
class numbers in the computation and translate the result in terms of 
the proper interval. 

7. In Exercise 5 there are 264 entries in the sum to be added. 
Show that much of the labor of the addition can be avoided by selecting 
the equal prices, then multiplying each by the number of times it occurs, 
and adding the resulting products to obtain the total sum of prices. 

The results of Exercises 5, 6 and 7 suggest the computing 
of the mean from a frequency table in accordance with the 
following rule : multiply each deviation by its frequency, add the 
resulting products, and divide this total sum by the total fre- 
quency. The quotient is the value of the mean. Thus, from the 
frequency distribution of Top Beef Cattle Prices of Chapter III, 
obtained on page 26 — 6, 13, 35, 43, 30, 30, 18, 12, 15, 16, 17, 3, 
6, 7, I, — the mean price is given by the expression — 

(32) 



d = 
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1x6 + 2x13 + 3x35 + 4x43 + 5x30 + 6x30 + 7x18 + 8x12 + 



252 
+ 9 X 15 + 10 X 16 + 11 X 17 + 12 X J + 13 X 6 + 14 X 7 + 15 X 1 

252 * 

= 6.00, where d is the distance of the computed mean from the 
origin. 

The mean class is thus 6.00; that is, 6 of the 50-cent in- 
tervals. This gives 7.25, the mid-value of class 6, as the mean 
price. 

Whenever the frequency table is available the method just 
described is usually the shortest method for computing the value 
of the mean. However if the frequency distribution is not 
needed for any other purpose and especially if an adding ma- 
chine is at hand the saving of time in the computation of the 
mean does not ordinarily justify the compilation of a frequency 
table merely for the one purpose of finding the mean. 

The following is the computation for mean height from 
the data at the beginning of Chapter III. 

Let us take the origin at height 60. Then the computation 
scheme will be as follows : 



Computation 


of 


the Mean. 










Dev. times 


Class. 


Deviation. 


Frequency. 


Freq. 


61 


1 




2 


2 


62 


2 




10 


20 


63 


3 




11 


33 


64 


4 




38 


152 


65 


5 




57 


285 


66 


6 




93 


558 


61 


7 




106 


742 


68 


8 




126 


1,008 


69 


9 




109 


981 


70 


10 




87 


870 


71 


11 




75 


825 


72 


12 




23 


276 


73 


13 




9 


117 


74 


14 




4 


56 



750 5,925 



5,925 
Table II. 
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Hence the mean height is 7.9 classes; that is, 7.9 inches from 
the origin, and is therefore equal to 67.9 inches. 

Statistical Properties of the Arithmetic Mean. What is 
the statistical significance and interpretation of the arithmetic 
mean? If a higher price were substituted for one of the January 
beef cattle prices the resulting arithmetic mean would be larger, 
but not so much larger as the individual price because in the 
process of obtaining the mean the price increase is divided by 
the total number of prices. Hence a larger mean denotes that, 
as a whole the values of the distribution are greater, and a 
smaller arithmetic mean is to be interpreted as indicating a 
relative lower series of values. And since all increases and 
decreases are to be divided by the number of varieties the 
changes in the value of the arithmetic mean are relatively 
smaller than those of the individual values. Thus a decrease 
of 50 cents must occur in each of the above prices in ofder to 
decrease the arithmetic mean by the same amount. A decrease 
of 50 cents in one-half the variates decreases the. arithmetic mean 
by only 25 cents, and so on. That is, the arithmetic mean is 
relatively more stable than is an individual measurement. Thus 
if several groups of 750 students were measured for height and 
the frequency distribution tabulated and the means computed 
for each group it would be found that the means would differ 
but little while the frequency of any one class, 67 inches for 
instance, would vary considerably from distribution to distribu- 
tion. 

It is to be noted that a single increase of 50 cents in the 
price for one month has exactly the same effect on the value of 
the arithmetic mean as does a lo-cent increase in the prices of 
each of five months. But is this true statistically? Should the 
exceptionally high price be given so much weight? Should the 
person of exceptional height be emphasized so strongly in the 
group of persons whose height is measured? 

That is, the value of the mean may not always be significant 
because a part of its value may be due to the presence of un- 
duly large variates. Whether an item is unduly large can be 
determined only from a study of the data itself for the mean 
conveys no information whatever as to the distribution of the 
variates ; it tells only of their general s^ize. That is, the statistical 
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function of the arithmetic 'mean is essentially to measure the 
size or magnitude of the data as a whole. 

Theorem. In any distribution the sum of the deviations 
from the mean is zero. That is, the sum of the positive devia- 
tions is equal to the sum of the negative deviations. The 
distance of the mean from any origin is obtained by taking 
the sum of the deviations from that origin and dividing by 
the total frequency, hence when this distance is zero the sum 
of the deviations must be zero. 

Weighted Arithmetic Mean. An apparent modification 
of the arithmetic mean is illustrated by the following: It is 
desired to obtain an index of food prices by taking the mean 
of the price quotations of 15 articles of food. It is decided 
however, that one of the quotations should be given twice 
the weight of the other articles. This is done by multiplying 
this quotation by two and taking the double quotation in the 
total sum. The article is said to have a zveight of two. The 
idea of weight introduces no new principles into the computation 
of the arithmetic mean. 

Adjustment or Graduation Formulas. A class of adjust- 
ment formulas of wide and convenient adaptability to the 
smoothing of data are based on the arithmetic mean. 

A series of terms not differing greatly from each other may 
be smoothed by replacing each by the mean of the five terms, 
for instance, of which the given term is the middle term. The 
distribution obtained by the first adjustment may in turn be 
similarly smoothed, and indeed the process may be repeated at 
pleasure. In this way the various graduation formulas of this 
type are built up. Next to the graphic method this is the simplest 
method for the smoothing of observations. 

Extensive application of this method has been made in 
the graduating of mortality tables, and under the name of the 
method of the moving average it is often used in smoothing 
data in which the general trend is obscured by the presence of 
more or less regular fluctuations. In this case the number of 
classes grouped together should be determined by the lengths of 
the cycles of the fluctuations.* If the cycles are irregular in 



* See King, Elements of Statistical Method, sec. 97. Also quarterly 
Publications of the American Statistical Society, Dec, 1915, and March, 
1916. 



36 INTRODUCTION TO MATHEMATICAL STATISTICS 

length the method of the moving average is not likely to yield 
satisfactory results. 

Exercises. 

8. Smooth the data of Table 1, by taking the means of each suc- 
cessive five terms, then of seven and finally of nine. 

9. Apply the method of the moving average to smooth the data 
of the Top Beef Cattle Prices. Is the method highly applicable to this 
data ? 

10. Discuss the reliability of this method for terms at the end of 
the range. 

11. Apply the five term method to the distribution of Ex. 5, Chap. I. 

The Geometric Mean. Let the price of a certain article 
for each year from 1910 to 191 5 be expressed as a percent of 
that of the preceding year as follows (assuming 100 for the 1910 
price), 100, 105, 118, 109, 102, 115. The percent increase 
from 1910 to 191 5 is obtained by multiplying together the five 
percents and is approximately 1.58. What uniform percent of 
increase will give the same percent of increase of 1915 over 
1910? Let (i + I*) be the constant multiplier or percent. Then 
we have 

(i+r)^= 105 X 118 X 109 X 102 X 115, 
= 1.58415. 

, and (1+^) = V 1.58415. 

= 1.096. 

Each of the unequal increases in the series may therefore 
be replaced by the percent, 1.096, and still give the same product. 

The population of continental United States in 1910 was 
91,972,266; in 1900, 75,994,575. On the assumption of a uni- 
form rate of increase during the decade what should be the 
value of this uniform rate in percent? As above, we have 

(i + r)^^ = 91,972,266/75,994,575 = 1. 21025. 

Hence (i + r) = ^®V 1.21025, 

= 1. 019. 

It may be noted that according to this method the popula- 
tion in 1906 was equal to 75,994*575 ^ (i-0i9)^- 

The problem in the case of the arithmetic mean is to 
find a uniform number which, when substituted for each of 
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the variates, leaves the total sum unchanged. In problems 
similar to that just preceding it is a matter of finding a num- 
ber which, when substituted for each of the given numbers, 
leaves the product of all the numbers unchanged; such a 
number is called the geometric mean. 

Exercises. 

12. Compute the geometric mean of the following numbers : 2, 4, 8. 

13. Compare from exercise 4, the 1906 population on the assump- 
tion of a uniform annual increase with that obtained from the assump- 
tion of a uniform annual rate. 

For any but the simplest problems the computation of the 
geometric mean cannot be accomplished without the use of 
logarithms. The following computation of the geometric mean 
of student heights from the data of page 24 illustrates the process. 

The geometric mean height = (• 61^ • 62^® • 63*^ • 64" • 65" 

. 6593 . 5jrl06 . 68126 . gglOO . J7Q87 . j7j76 . y2^^ ' ^^^ ' 74*)V780 

and 750 log geo. mean •= 

2 log 61 + 10 log 62 + II iog 63 + 

38 log 64 + 57 log 65 + 93 log 66 + 

106 log 67 + 126 log 68 + 109 log 69 + 

87 log 70+ 75 log 71 + 23 log 72 + 

9 log 73 + 4 log 74. 

1373.70355 

Hence log. geo. mean = = 1.83 160 

750 

and geo. mean height = 67.86. 

Exercises. 

14. Compute the geometric mean of the Monthly Top Beef Cattle 
Prices. 

15. Compete the geometric mean for the March precipitation at 
Columbus for the years since 1878. 

Properties of the geometric mean. Unlike the arithmetic 
mean the geometric mean is most powerfully affected by the 
smaller deviations because a small factor in a product has a 
proportionately greater influence on the result of the multiplica- 
tion than does a larger factor. 

Each property of the arithmetic mean has a corresponding 
property for the geometric mean because the logarithm of the 
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geometric mean is the arithmetic mean of the logarithms of the 
deviations. From this logarithmic correspondence all the prop- 
erties of the geometric mean can be derived* from those of the 
arithmetic mean. It is apparent, for instance, that the geometric 
mean applies to a series of deviations multiplied together in a way 
exactly parallel to that of the arithmetic mean and a series of 
terms to be added. Other parallels are, a chain of relative prices 
and a series of price increases ; interpolation on the assumption 
of a uniform rate and of a uniform increase; compound interest 
and of simple interest. 

The Median. Let the years 1879 to 1916 inclusive be 
arranged in order of the March precipitation beginning with the 
lowest. We then have with the dataf measured to hundredths 
of an inch : 

1910 0.28 

1885 0.53 

1889 0.66 

1915 1.19 

1895 1.23 

1894 1.79 

1901 1.82 

1905 1.87 

1893 1.92 

1892 2.23 

1911 2.36 

1880........ 2. 42 

1914 2.46 

1887 2.56 

1900 2.59 

1902 2.63 

1909 2.68 

1896 3.04 

1883 3.20 

1884 3.59 

1879 3.77 

1888 3.79 

1886 3.90 

1881 4.01 

1903 4.13 

1912 4.56 

1906 4.59 

1891 4.64 

1899 4.69 

1882 4.76 

1904 4.93 

1907 5.21 

1897 5.45 

1890 5.63 

1908 6.03 

1898 7.03 

1913 8.09. 



* See Zizek, "Statistical Averages," Chapter III. Also Jevons, "On 
the Variation of Prices and the Value of the Currency since 1782," 
Jour. Roy. Stat. Soc, Vol. XXVIII, 1865. Galton, "The Geometric Mean" 
in Vital and Social Statistics," Proc. Roy, Soc, Vol. XXIX, 1897, p. 365. 
McAlister "The Law of the Geometrical Mean," the same, p. 367. Yule, 
"An Introduction to Statistics," p. 123. 

tU. S. Weather Bureau Report, Columbus Station, 1917. 
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The middle year, 1883, in this ordered arrangement is 
called the median year with respect to March precipitation; 
the median precipitation of 3.30 inches, being that of the median 
year. 

In general the median individual is defined as the indi- 
vidual so located that there are as many individuals with a 
greater value of the characteristic as with a less value; and the 
middle value of the measured characteristic is spoken of as 
the median value of the characteristic. 

If the number of variates is even the medium is assumed to 
lie between the two middlemost variates. 

It is obvious that the above median precipitation year might 
have been obtained by a simple process of counting and inspection 
of the data without the somewhat laborious process of arrang- 
ing the variates in order. 

Exercises. 

16. From the data of Exercise 2, Chapter III, determine the 
median Columbus monthly temperature, and the median year in respect 
to temperature. 

17. From the price data of page 2.5 determine the median top beef 
cattle price. 

18. From the data of Exercise 19, Chapter I, determine the median 
price for wheat. 

When the data is in the form of a frequency distribution the 
computation of the position of the median is much facilitated. 
All that is necessary then is to start from one extremity of the 
distribution and include successive classes until half the total 
frequency is obtained. The only point of difficulty in this case 
is when the median is located within a class. Then it is necessary 
to interpolate within the median class for the more exact position 
of the median. To illustrate the method of interpolation let us 
find the median student height from the data at the beginning of 
Chapter III. Half of the number of variates is 375. Counting 
from the lower extremity we find, up to and including class 67, a 
frequency of 317, so that it is necessary to take 58 individuals 
from class 68. Hence we may assume that the position of the 
median will be 58/126 of a unit from the left boundary of class 
68. Since this boundary is at 67.5 the median is located at 
67.96 inches. 

Geometrically, the median deviation locates the ordinate 



40 INTRODUCTION TO MATHEMATICAL STATISTICS 

which divides the area under the frequency curve into two equal 
parts. 

Exercises. 

21. What is the median point of population as determined by the 
Bureau of the Census (see pp. 60-52, Vol. I., Report of the 13th Census) ? 

22. Distinguish the median point of population from the center of 
population. 

Quartiles. Each half of the distribution, one on either 
side of the median, may be divided into two equal parts. These 
two points of division are the First and Third Quartiles. 

The two quartiles and the median thus divide the variates 
into four classes of equal frequencies. 

In data having predominately large frequencies near the cen- 
ter of the distribution the quartiles are relatively close to the 
median, and in widely scattered data the quartiles are relatively 
far from the median. This property of the quartiles is developed 
and applied in the next chapter. 

The median can be found directly from the cumulative curve 
by drawing a horizontal line thru the point on the vertical scale 
corresponding to half the total frequency. The abscissa of the 
point of crossing of this horizontal line and the curve is the me- 
dian deviation. 

Exercises. 

19. By drawing the cumulative curve locate the median sudent 
height. 

20. From the frequency distribution of top beef cattle prices of 
page 25 determine the median price by using the cumulative curve. 

Deciles. The decile variates are the variates which 
separate the frequency into ten equal classes. The median is of 
course the fifth decile but the quartiles are not deciles. The chief 
use of the deciles, like that of the quartiles, is in determining the 
shape of the distribution. 

Exercises. 

23. Determine the quartile precipitations from the data of Ex- 
ercise 5, Chapter I. 

24. Determine the decile precipitations from the data of Exercise 3, 
Chapter II. 

25. Determine the qualftile and the decile temperatures from the 
data of Exercise 2, Chapter III. 
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26. Determine the quartile prices from the top beef cattle prices 
of page 25. 

27. Determine the quartile top beef cattle prices from the data in 
the form of a frequency distribution of the data of page 25. 

In this problem the quartile prices must be obtained by a process of 
interpolation similar to that described for the median. 

Statistical Properties of the Median. The value of the 
median ordinate depends not on the actual values of the variates 
but solely on the relative values. The data need be given with 
only enough exactness to permit the arrangement of the variates 
in order with respect to the attribute considered. Moreover, it 
is only the arrangement near the median value that must be care- 
fully attended to, consequently the median can not give detailed 
information of the variates at the extremities of the ranges. 

There is apparently no apriori reason why the value of the 
median should not show considerable variation from sample to 
sample taken from the same material, but in practice it is found 
that the median shows as high if not higher degree of stability 
than does the arithmetic mean. Thus if a second group of 750 
students were measured as to height and the median computed 
It would most likely be found to differ only slightly from that of 
the group already discussed. This slowness of change in the 
median means that the median is not greatly affected by the 
presence of accidental and irrelevant influences. That is, dif- 
ferences in the value of the median are not likely to be merely 
accidental and hence the median measures significant properties of 
the material. For instance, a distribution of wages showing a 
higher median wage must be significantly a group of higher 
wages. 

The properties just discussed together with the fact that the 
median can be located by the simple process of counting renders 
the median a highly important average in practical statistical 
work. 

The Probable Deviation. The median variate divides 
the data into two classes of equal frequencies. Hence it is an even 
chance that an individual selected at random will fall into a desig- 
nated one of the two classes. If the median height of freshmen 
students is 68 inches it is an even bet that a student concerning 
whose height nothing is known has a height less than 68 inches. 

Likewise it is "an even bet that a student selected at 
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random will have a height between the first and third quar- 
tiles. The range from the median to the third or first quartile, 
one-half of the range within which the chances are even for 
an individual measurement to lie, is called the probable 
deviation.* 

Exercises. 

28. Determine the probable deviation for top beef cattle prices. 

29. Determine the probable deviation for monthly precipitation at 
Columbus ; for monthly temperatures at the same station. 

30. Show that the probable deviation is necessarily connected with 
the frequency distribution and not with a chronological distribution. 

« 

The Mode. Notice that, in the frequency distribution of 
student heights, class 68 has the greatest height and that the 
high point on the frequency curve is within the same class. 
The class of greatest frequency is called the modal class and the 
deviation with the highest ordinate the modal deviation. A 
mode is thus defined as a class or deviation of greatest fre- 
quency; more accurately, it is the class or deviation of greater 
frequency than that of either the class immediately greater or 
immediately less. This second definition allows for distributions 
having more than one mode. 

Exercises. 

31. From the smoothed frequency curve of the data of page 27 
determine the modal monthly precipitation. 

32. Determine the modal March temperature for Columbus. 

It is possible to locate the mode within a class by a process 
of interpolation similar to that described in the determination of 
the median but by far the easiest method is to construct the 
smooth frequency curve and determine the abscissa or deviation 
of the greatest ordinate. 

When the data seems to have more than one mode care 
must be exercised in deciding whether to smooth out the 
apparent modes. In the frequency distribution of monthly 
temperatures it is evident that there are summer and winter 
modal temperatures. The telephone-calls data of Exercise 33 
below shows more than one mode. On the other hand the data 
of age distribution reported by the United States Census Bureau 



* Certain qualifications of this definition are discussed in Chapter V. 
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shows a tendency for the frequencies at the even ages to be 
larger than at the odd ages. This latter tendency is partly due to 
the fact that persons who are uncertain as to their exact age seem 
to show a preference for an even number. These apparent modes 
should be smoothed out. Data with essentially one mode is said 
to be unimodal; with more than one mode, multimodal. 









Exercises. 








■ 33. Smooth the following data of the telephone call 


s for 


one da 


at a business exchange* 


and locate the modes. 








Time .... 6-7 


7-8 


8-9 


9-10 10-11 11-12 


12-1 


1-2 


2-3 


Calls .... 1595 


3430 


6389 


6904 7282 7358 


6361 


5659 


6186 


Time 3-4 


4-5 


5-6 


6-7 7-8 8-9 


9-10 


10-11 


11-12 


Calls .... 6597 


6510 


6093 


4508 4210 2289 


1197 


916 


314 


Time 11-12 














Calls .... 12 














34. Do the 


same : 


for the 


following residence 


calls.** 






Time 6-7 


7-8 


8-9 


9-10 10-11 11-12 


12-1 


1-2 


2-3 


Calls .... 1256 


3796 


6604 


4098 4240 3816 


5852 


4421 


3136 


Time 3-4 


4-5 


5-6 


6-7 7-8 8-9 


9-10 


10-11 


11-12 


Calls .... 4344 


3267 


4541 


4778 4039 2088 


1176 


655 


187 



35. Determine the modal classes for the top beef cattle prices. 

Statistical Properties of the Mode. Because the neces- 
sary modifications are easily made for multimodal data the prop- 
erties of the mode are here discussed only for a unimodal dis- 
tribution. 

Since the modal class or deviation is that of greatest fre- 
quency; that is, since more variates belong to that class than to 
any other, the mode is the most typical of all the variates of 
a distribution. If any one variate is to be selected as decrip- 
tive of the data the modal variate should be that variate. 
The mode is accordingly said to define the type of the dis- 
tribution. The significance of the mode as a type depends, of 



* By permission of Central Union Telephone Company, Columbus. 
Main Exchange. 

** Same, North Exchange. 
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course, on the relative preponderance of its frequency. Thus 
the frequency of height 68 in the case of the student dis- 
tribution of page 24 is 126 and the combined frequency of the 
classes near the modal class is a large percent of the total fre- 
quency. In the beef cattle prices of page 25 the modal class 
has a frequency of 43 and there is not as rapid falling off in the 
frequency on either side of this class as is shown by the 
height data. Hence in the price data the mode does not have as 
great significance as it does in the height data. Data show- 
ing a strong tendency to concentrate about the mode is said 
to be highly stable or true to type. Measures of trueness 
to type are discussed in the following chapter. 

The position of the mode depends only on the values of a 
few variates so that the mode like the median gives little infor- 
mation of the extremes of the range. 

The mode cannot be accurately determined by a simple 
process of arithmetic as can the median and the mean. 

The mode being the predominating value, the type, the fash- 
ion, it is what is ordinarily in the popular mind when an average 
is spoken of. The statement that the average person spends one- 
third of his income in rent is most likely to mean that more per- 
sons spend about that per cent than any other per cent. 

Exercises. 

36. Determine the modal class for each frequency distribution of 
Chapter III. 

37. Show that the concept of mode does not apply to a curve of the 
historical type. 



CHAPTER V. 
THE FORM OF A DISTRIBUTION. 

Dispersion. It is stated in the preceding chapter that 
the significance of the mode as a representative of the data de- 
pends on the extent to which the data conforms to the mode as a 
type. That is, if the sum of the frequencies near the mode is a 
relatively large per cent of the total frequency the modal devia- 
tion is highly typical and the data is not highly variable. The 
word variable is used because, if in the data a certain type does 
not predominate, different samples will have a tendency to show 
widely differing distributions. If, to illustrate, the modal fre- 
quency of a second distribution of the heights of 750 students is 
-only 95 with a similar reduction in the other larger frequencies, 
this second distribution is not so true to the type expressed by the 
mode as is the first distribution. 

To repeat, a distribution with small frequencies at the 
ends of the ranges and with the frequencies concentrated at 
a point is said to be true to type, to be highly stable. Let us 
investigate various methods of measuring the extent to which 
the data is scattered or dispersed about the class of concen- 
tration. 

Measures of Dispersion. Because the breadth of the 
range depends on the usually uncertain data at the extremes it 
does not furnish a reliable measure of the extent to which the 
data is spread-out. As given on page 24 the range of student 
heights is 14 inches; the inclusion of a single student of height 58 
inches would increase the range by more than twenty percent. 

We have seen that in theory the dispersion should be meas- 
ured from the mode but in practical statistical work the mean, 
median and mode differ so little in position that it is ordinarily 
permissible to measure the disperson from the mean. 

The sum of the deviations about the mean is useless as a 
measure of dispersion because, as was proved on page 35, this 
sum is zero regardless of the spread or dispersion of the dis- 
tribution. 

Mean Deviation. Since the object in measuring disper- 

(45) 
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sion is to determine the divergences of the variates from an 
average it is the amount of a devergence that counts and not 
its direction. Hence a logical measure of dispersion is obtained 
by adding the divergences, all coimted positive, and then divid- 
ing the sum by the total frequency. This gives the mean 
deviation. 

The form for the computation of the mean deviation is the 
same as for the arithmetic mean except that all deviations are 
measured from the mean, median or mode, whichever is chosen 
for the origin, and all negative signs are disregarded. 

Exercises. 

1. Compute the mean deviation from the arithmetic mean of the 
Student Height Data of page 24. 

Referring to the computation for the arithmetic mean on page 33, 
let us add a column obtained by taking the difference between the mean 
and each deviation and then multiply these differences by the respective 
frequencies and add the resulting products. This sum is then divided 
by the total frequency in order to obtain the mean deviation. We thus 
have : 

Computation of the Mean Deviation. 

Class No. Diff. Freq. Prod. 



1 


6.9 


2 


13.8 


2 


5.9 


10 


59.0 


3 


4.9 


11 


53.9 


4 


3.9 . 


38 


148.2 


5 


2.9 


57 


165.3 


6 


1.9 


93 


176.7 


7 


0.9 


106 


95.4 


8 


0.1 


126 


12.6 


9 


1.1 


109 


119.9 


10 


2.1 


87 


182.7 


11 


3.1 


75 


232.5 


12 


4.1 


23 


94.3 


13 


5.1 


9 


45.1 


14 


6.1 


4 


24.1 






750 


11,473.5 



1.9 
Mean deviations 1.9 classes. 
Since each class interval is one inch the mean deviation is 1.9 inches. 

Table III. 
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2. Compute the mean deviation about the arithmetic mean of the 
price data of page 25. 

3. Compute the mean deviation about the median of the price data 
of page 25 and compare the result with that of Exercise 2. 

4. Compute the mean deviation about the arithmetic mean of the 
precipitation data of Exercise 5, Chapter I, and of the temperature data 
of Exercise 6 of the same chapter. 

5. From the frequency tables of Exercises 2 and 3 of Chapter III 
compute the mean deviation of monthly precipitation and of monthly 
temperature. ^ 

For purposes of comparing the stability of diflFerent distribu- 
tions it is desirable to divide the mean deviation by the mean or 
median, w^hichever is used. When this is done the mean deviation 
is expressed as a fraction of the base average. For instance, it 
seems reasonable to say that a mean deviation of 0.3 with an 
arithmetic mean of 20 has the same significance as a mean devia- 
tion of 0.9 based on an arithmetic mean of 60. 

Exercises. 

5. Compare the dispersions in Exercises 1, 2, 3, 4. 

Because, as is presently proved, the mean deviation is least 
when taken about the median it is theoretically best to compute 
the mean deviation about that average. When so done there is a 
certain degree of standardization which is not attained with any 
other average as a base, but the point is not of great practical im- 
portance unless the median and the arithmetic mean differ 
markedly. 

Proof that the mean deviation is smallest when taken 
about the median* 

Let P be a point on the line S-T between the points A and B. 
The sum of the deviations of P from A and B is, without regard to 
the sign of the negative deviation PA, PB + PA, and this sum is equal 
to AB. If P should lie without the segment AB the sum of the two 
deviations would be greater than AB. Likewise the sum of the distances 
of P from any other two points C and D is least when P lies between 
them. Hence the total sum of deviations of P from any number of 
points is least when there are as many points on one side of P as on 
the other; that is, when P is the median of the points. 

S ACE PB DF T 
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Exercises. 

6. According to the measure supplied by the mean deviation which 
is the more variable, the monthly mean temperature or the monthly 
mean precipitation at Columbus? 

7. From the data of heights on page 24 and the data of weights 
of Exercise 13, Chapter III, determine which is the more variable, 
student height or student weight. 

Statistical Properties of the Mean Deviation. The mean 
deviation as a measure of dispersion has all the properties of a 
mean — it takes all the variates into account ; it takes each variate 
according to its size and consequently may give more prominence 
to extreme variates than their statistical importance may warrant ; 
it is computed by a simple process of arithmetic. Because in 
forming it only the numerical values of the deviations are used 
and all distinctions between positive and negative deviations are 
disregarded the mean deviation is not well adapted to certain 
statistical purposes for which the standard deviation, to be next 
discussed, is preeminently fitted. 

Altogether the mean deviation is an index of dispersion of 
practical importance and should ordinarily be used either alone 
or in connection with other measures. 

The Standard Deviation. The mathematically simplest 
device for eliminating negative signs is by squaring the terms. 
Hence if the difference between each deviation and the mean 
be squared, the sum of the squares added Jand the resulting 
sum divided by the total frequency the mean squared devia- 
tion thus obtained, is a measure of dispersion which is arith- 
metically more convenient than is the mean deviation. 

The computation of the mean squared deviation differs from 
the computation of the mean deviation, which is illustrated under 
Exercise i, only in that the deviation differences are squared be- 
fore multiplication by the frequencies. It is of course possible 
to compute directly from the data without using the frequency 
table but only a slight error is introduced by the combining of 
the actual values into reasonably narrow classes and much labor 
is ordinarily saved because only one multiplication is then re- 
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quired for each class instead of for each individual variate as is 
necessary if the frequency distribution is not used. 

Exercises. 

6. Determine the mean squared deviation about the arithmetic 
mean of the data of Student Heights. 

7. Do the same for the Prices of Top Beef Cattle. 

8. Do the same for Monthly Precipitation at the Columbus Station. 

9. Do the same for Monthly Temperatures at the Columbus Station. 

The above method of computing the mean squared deviation 
involves fractional differences in the deviations. By the follow- 
ing modification fractions can be avoided. 

Short Rule for the Mean Squared Deviation. Select an 
integral deviation near the actual arithmetic mean and find the 
difference between each deviation and this selected deviation. 
Square each of the differences so obtained, multiply by the cor- 
responding frequency, add and divide by the total frequency. 
The result is the mean squared deviation from the selected value. 
To obtain the mean squared deviation from the arithmetic mean 
all that is necessary is to subtract from the value just com- 
puted the square of the difference between the true arithmetic 
mean and the selected integral value. If the mean squared 
deviation about the actual arithmetic mean is denoted by 
the Greek letter o-, (sigma), and the mean squared deviation 
about any other point by the same symbol written with a prime, 
o-'; we have, on recalling that the letter d is used to denote the 
deviation of the arithmetic mean from the origin, the following 
formula : 

2 '2 rl2 

(T= (T — a . 

To prove this formula let the deviations from the original 
origin be denoted by X and the deviations from the arithmetic 
mean by x and let the distance of the mean from the original 
origin be denoted by d. Then X ^= x -\- d for each individual in 
the distribution and .r = X — d. 

The standard deviation is obtaineed by squaring each x and 
dividing by the total frequency. Performing these operations 
we have 
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(x,-dy+(x,-dy+.... 

a' = 

N 

A/" '- 
^— Nd-/N = (t'2 — J2 

Since X^ + ^2 + • * • is zero by the Theorem of page 35. 

Exercises. 

10. Recompute by the shortened method the mean squared devia- 
tion about the arithmetic mean of the Student Heig*hts. 

Let us take the assumed origin at class 68. The mean 
squared deviation is then obtained by the following computa- 
tion: 

Dev, 



Class. 


Dev. 


squared. 


Fr^^. 


Prod. 


1 


7 


49 


2 


98 


2 


6 


36 


10 


360 


3 


5 


25 


11 


275 


4 


4 


16 


38 


608 


5 


3 


9 


57 


513 


6 


2 


4 


93 


372 


7 


1 


1 


106 


106 


8 








126 





9 


1 


1 


109 


109 


10 


2 


4 


87 


348 


11 


3 


9 


75 


675 


12 


4 


.16 


23 


368 


13 


5 


25 


9 


225 


14 


6 


36 


4 

720 


144 




) 4032 



5.60 

c? = 68 — 67.9 = 0.1; d^ = Mh 
Therefore <r» = 5.60 — 0.01 = 5.59 
and <r = 2.31. 

Table IV. 

11. Using the shortened method compute the mean deviation for 

the Monthly Precipitation data. 
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The Standard Deviation. The square root of the mean 
squared deviation about the arithmetic mean is called the 
Standard Deviation. 

From the formula o-^ ^ o-'^ — d- it is seen that any other 
mean squared deviation always exceeds the square of the stand- 
ard deviation by the square of the difference between the as- 
sumed origin and the true arithmetic mean. 

This gives a certain practical and theoretical preference to 
the standard deviation over that of any other mean squared 
deviation. For this reason, and because certain other computa- 
tions are rendered simpler by so doing, the mean squared devia- 
tion about any other value than the arithmetic mean is seldom 
computed even tho the idea of trueness of type centers about the 
mode. Since the mean and the mode rarely differ by more than 
a small amount the square of this difference will be relatively 
still smaller and as a result, the difference between the square of 
the standard deviation and the mean squared deviation about the 
mode is ordinarily negligible. 

Properties of the Standard Deviation. Since a small 
value for the standard deviation can arise only when the variates 
are closely concentrated about the mean or mode and since a 
large value must be due to a relatively high frequency of the 
variates near the extremes of the distribution, the standard devia- 
tion is a measure of the dispersion of the data. Because the 
effect of squaring is to diminish the importance of the smaller 
values and to exaggerate the importance of the larger values a 
small value for the standard deviation shows conclusively that the 
data is highly true to type and stable, while on the other hand 
a large value may to some extent be due to the presence of the 
larger frequencies of the extreme variates and hence not alto- . 
gether significant. But even with this qualification the standard 
deviation is a thoroly practicable and reliable index of the dis- 
persion of the data. 

Exercises. 

12. Discuss the comparative variabilities of the distributions for 
which the standard deviations have been computed in the preceding 
exercises of this chapter. 

13. Does a standard deviation of 2.4 for height denote a smaller 
variability than a standard deviation of 15 pounds for weight? ^ . 
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The Coefficient of Variability. As in the case of the 
mean deviation the significance of a value for the standard devi- 
ation depends on the size of the variates. A variation of lO 
feet in a measurement of five miles is of the same degree of 
accuracy as a variation of 2 feet in one mile. 

It is, therefore, reasonable to divide the standard devia- 
tion by the mean in order to express it as a fraction of the 
size of the variates. This quotient is ordinarily quite small, 
so that it is usual to multiply it by lOO. The resulting co- 
efficient — lOO times the standard deviation divided by the 
mean — is called the coefficient of variability. 

Exercises. 

14. Compare the value of the coefficient of variability for height 
with that for weight as shown by the data of the preceding chapter. 

15. Discuss the comparative variabilities of March and July tem- 
peratures as recorded at the Columbus Station. 

16. Apply the coefficient of variability to determine which is the 
more variable, Columbus monthly temperature or Columbus monthly 
precipitation. Compare the variability of annual temperatures with that 
of precipitation. 

17. Compare the value of the coefficient of variability with that of 
the quotient of the mean deviation by the arithmetic mean. 

18. Discuss the comparative practical usefulness of the two indices 
of variability. 

The Quartiles as Measures of Dispersion. The distance 
from the median to the third quartile is the interval that in- 
cludes half the frequencies on the right of the median. The 
distance from the median to the first quartile is the interval 
that includes half the frequencies on the other side of the 
median. Now if these distances are relatively large it must 
mean that the frequencies at the center are not large in compari- 
son with the total frequency. That is, if the first and the third 
quartiles are close together the distribution must be closely 
concentrated about the median; must be highly typical; must 
show a low degree of variability, because in every case one-half 
tjie total frequency is included between these two quartiles 
and if the interval is narrow the ordinates must be tall, that is 
the frequencies iii the center must be predominatingly large, 
in order to-include half the total frequency. If the data has 
a flat frequency curve so that the degree of variability is large 
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and the trueness to type small the two quartiles will be com- 
paratively far apart. 

Ordinarily the distance between the first quartile and 
the median is approximately equal to the distance from the 
median to the third quartile so that the distance from the 
median to the third quartile is taken as the index of disper- 
sion of the distribution. This distance is called the probable 
deviation. 

Since half the total number of frequencies are included 
between the two quartiles the chances are even that an in- 
dividual of the group, selected at random, will have a deviation 
lying between the quartile deviations. In other words, the 
chances are even that an individual selected at random from 
the group will have a deviation numerically less than the prob- 
able deviation. If in one group of 750 students, for instance, 
it is an even bet that a student selected at random has a height 
between 66 and 70 inches and in a second group the range for 
even chances is from 67 to 69, the second group is said to be 
the more true to type. 

Formula for the Probable Deviation. The probable 
deviation can always be found by the simple process of locating 
the quartiles. It is proved in the following chapter that for a 
certain special, tho very frequently occurring, form of distri- 
bution the probable deviation is equal to the standard deviation 
multiplied by a constant. 

In symbols, we have P. E. = 0.6745 <^> where the symbol 
P. E. inherited from the theory of errors developed by Gauss 
denotes the probable deviation. 

If the distribution is markedly unsymmetrical the above 
formula may not hold accurately and there are symmetrical 
distributions for which it does not hold exactly. But extreme 
accuracy in the matter of an index of dispersion is not necessary 
or desirable; the formula is generally used regardless of the 
form of the distribution. 

Exercises. 

19. Compute the probable deviation of the Student Heists. 

20. Compute the probable deviation of the Student Weights. 

21. Compute the probable deviation of the Top Beef Cattle Prices. 
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Probable Deviation of the Arithmetic Mean. The arith- 
metic mean in the Student Height data of page 24 is 67.9 
inches. The mean height of a second group of 750 students 
from the same student population would most likely not differ 
greatly from 67.9 but it is not at all likely that it would be 
exactly the same as that of the first group. Let group after 
group be taken and the value of the mean computed for. each 
group. The values of these means would themselves form 
a frequency distribution from which a mean and probable devia- 
tion could be obtained. 

Now if the student data is highly typical and stable the 
variation in the successive means will be within a small range 
and hence the probable deviation of the means will be relatively 
small. Let us assume the value which we have obtained by 
actual observation, namely 67.9, is the best estimate of the true 
mean of the height of all such students ; that is, that the devia- 
tion of greatest frequency in the frequency distribution of 
means is 67.9. Then the probable deviation in this distribution 
will be the probable deviation of the mean. It can be proved that 
this probable deviation is obtained in accordance with the formula, 

p. E. of mean = 0.6745 — ^ 



yJN 
Exercises. 

22. Compute the probable deviation of the mean Monthly Precipita- 
tion for the Columbus Station. 

23. Compute the probable deviation ^or the Monthly Top Beef Cat- 
tle Prices of page 25. 

Probable Deviation of the Standard Deviation. The 

probable deviation of the standard deviation may be explained 
by a process of reasoning similar to that for the probable devia- 
tion of the mean. The formula for this probable deviation is: 

P. E. of standard deviation = 0.6745- 



yj2N 

Exercises. 

24. Compute the probable deviation of the standard deviation of 
Student Heights. 

25. Compute the probable deviation of the standard deviation of 
Student Weights. 
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26. Which is the more variable, the standard deviation of Student 
Heights or of Weights? 

Statistical Significance of the Probable Deviation. The 

statistical application of the probable deviation may be illus- 
trated by the following questions : The mean height of a group 
of students is 67.9 with a probable deviation of 1.78 inches. 
The height of a student taken at random from a second group 
is 72 inches. What is to be concluded? That the two groups 
are taken from essentially the same populations or that they 
"all are taken from distinctly different populations? That is, 
how many times may a deviation exceed the probable deviation 
and still be assumed to come from the same material? It must 
be apparent that this is a fundamental question in statistical 
analysis. Further discussion of it is deferred to the following 
.chapter. 

The Deciles as Measures of Dispersion. The position 
of the deciles shows the spread of the variates in the distribu- 
tion. If the deciles near the middle of the distribution are 
close together and the deciles near the beginning and the end 
of the ranges are far apart the distribution is highly variable 
and not true to type. Because there are nine decile positions 
to observe in a distribution the decile is not so simple a measure 
of dispersion as is the quartile or standard deviation, tho this 
very fact of greater detail may in some cases be of advantage. 

Exercises. 

27. By the use of the deciles compare the variability of monthly 
precipitation at the Columbus Station with that of monthly temperatures 
at the same station. 

Symmetrical and Asymanetrical Distributions. The 

curve of student heights is essentially of the same shape to the 
right of the highest point as it is to the left. It is a symmet- 
rical curve. (Fig. VI.) Statistically the fact of* symmetry means 
in this case that there is no tendency for the students to be either 
tall or short; that there is no selection between the tall and the 
short; that the chances for a tall person to belong to the 
stjildent group are equally as good as those of a short person; 
thkt there is absolutely no connection between being a member 
6i this student group and being tall or being short. 
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Fig. VI. A Symmetrical Curve. 

On the Other hand the curve of height of the members 
of a police force would have a longer range to the right 
than to the left because extremely short persons are excluded. 
The curve in this case is said to be as3aiimetrical. Asymmetry 
in a curve denotes the presence of selection in the data; of a de- 
pendence; of an expressed preference for certain values of the 
attribute. 




Fig. VII. An Asymmetrical or Skew Curve. 



Exercises. 

28. Examine each frequency curve of Chapter III for symmetry and 
discuss the significance of each case of asymmetry. 

The Position of the Averages and Asymmetry. In the 
symmetrical curve the mean, median and mode coincide. The 
cutting off of the range to the left tends to move the mean 
to the right because the longer deviations are to the right, 
and it has been seen that the mean is most affected by the 
longer or extreme deviations. This places the median at the 
left of the mean. The mode will tend to be moved to the left of 
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the median because both of the effect of the moving of the mean 
to the right and of the shortening of the left range with a con- 
sequent heaping up of the frequencies within the left half. The 
result is that the three averages are then in the order — mode, 
median, mean. It has been verified experimentally that for 
moderately asymmetrical distributions the distance of the median 
from the mode is about one-third the distance of the mean 
from the mode. 

Skewness. An asymmetrical curve is said to be skew. 
Skewness is positive when the longer range is to the right 
and negative when the longer range is to the left. 

Measures of Skewness. Since the mode and mean are 
separated to an extent depending on the degree of skewness 
present, a logical measure of skewness is the difference between 
the mean and the mode. Beciiuse a large difference between the 
positions of the mean and the mode in widely spread-out data 
may not be so significant as a smaller difference in highly con- 
centrated data it is advisable to divide this difference by the 
standard deviation. Hence we have. 

Mean — Mode 
Skewness = • 



Exercises. 

29. Compute the skewness of the following data of incomes: 

Estimated Distribution of Income among the Single Women of 
Continental United States in 1890. (King, Wealth and Income, p. 224) 

Class 0-$200 |200-$300 $300-$400 $400-$500 $500-1600 

No. in Thousands. 10 70 560 530 280 

Class $600-1700 $700-$800 • $800-$900 $900-$1000 

No. in Thousands. 150 120 37 22 

Class $1100^$1200 $1200-$1300 $1300-$1400 

No. in Thousands. 12 8 5 

30. Show that the above formula for skewness correctly indicates 
the sign of the skewness. 

A Second Measure of Skewness is obtained as follows: 
Any measure of skewness must take into account the distinction 
between positive and negative deviations. The total sum of 
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deviations from the mean is zero regardless of the form of 
the distribution; the standard deviation involves the deviations 
as squares and hence obliterates the distinction between positive 
ahg negative deviations. The mean cubed deviation, however, 
will serve as a measure of skewness. The longer deviations 
to the right, if the skewness is positive, will be more powerfully 
affected by the operation of cubing than will the shorter devia- 
tions to the left and hence the total sum of cubed deviations 
will be positive. It is well to extract the cube root of the 
mean cubed deviation and then in order to express the skewness 
as a fraction of the spread of the distribution to divide the result 
by the standard deviation. 

Exercises. 

31. From the computation form of Exercise 1 compute, in accord- 
ance with the second method, the skewness for the student height distri- 
bution. 

32. Do the same for the distribution of incomes. 



CHAPTER VI. 
THE NORMAL PROBABILITY CURVE. 

The Equation of a Frequency Ciurve. As discussed in 
Chapter II, a smoothed curve is a graphic estimate of what 
would be the course of the data if it could be freed from acci- 
dental variations. The smoooth curve is therefore the geometric 
representation of a law of connection or variation. It shows, 
for instance, the variation of temperature with the seasons; the 
tendency for precipitation to depend on the month of the year; 
the most likely percent of students at each height. 

The presence of an underlying law of connection in the 
data implies the presence of an algebraic law connecting the 
X and the y coordinates. The algebraic statement of the law 
expressing y in terms of x is called the equation of the curve. 

If the equation is given, the ordinate can be computed for 
any abscissa and hence the curve can be located by plotting a 
sufficient number of computed points. 

In some distributions it is possible to discover a law of 
connection directly from the data, and then without an extended 
computation to translate this law into the proper algebraic form. 
We shall discuss in this chapter the equation of only 
one type of curve — the normal curve. This form of curve 
is suited to the representation of a large class of distributions. 
And the theory of the normal curve can be made use of in 
the determination of the probable deviation and in the dis- 
cussion of certain other properties even for a distribution to 
which it does not apply with sufficient accuracy to be adopted 
as the form of the smoothed curve. 

Statistical Theory of the Normal Curve. The height of 
a person is the resultant sum of a large number of elements 
such as the length of certain bones, the widths of cartilages, the 
erectness of posture. And, in general, any statistical data can 
be analyzed into elemental components. Whenever these ele- 
mental values are relatively small in comparison with the result- 
ant values and at the same time each element is equally likely to 
take any value within a small range, then the resultant data is 

said to he normally distributed. 

(59) 
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With an absence of selection, as is assumed, it is reasonable 
to conclude that the resulting distribution will be symmetrical. 
And it is also apparent, after some consideration, that the fre- 
quencies at the center will be high and those at the ends of the 
range very small. It may be noted that in order to have a nor- 
mal distribution it is not at all necessary that it be possible to 
actually compute the values of the elemental factors; it is only 
their existence under the above assumptions that is predicated. 

The Equation of the Normal Ciu^e. It can be mathe- 
matically demonstrated that the equation of the Normal Curve 



IS, 



N 

y= — 7=^ ' e 



1 

2 



(Ty/2Tr 

where A^ is the total frequency of the distribution ; o-, the stand- 
ard deviation; tt, the well known circle constant 3. 141 59; and e 
is a constant which is numerically equal to 2.71828. In this form 
of the equation x is measured from the arithmetic mean as 
origin. 

The Graph of the Normal Equation. Let us write the 



normal equation in the form, y = — 



the form, y = — . Z where Z = — =z 

o" ^/2'^' 



V27r 



and then in 



1 
2 



X2 



The tables of Sheppard give the value of Z for each value 
of x/o" from 0.00 to 6.00. Table V serves to illustrate the com- 
plete table. 

Table of Ordinates and Areas of ithe Normal Curve 



X/<T 


Z 


Areas. 


X/<r 


Z 


Areas. 


0.0 


0.399 


0.000 


1.2 


0.194 


0.385 


0.1 


0.397 


0.040 


1.4 


0.150 


0.419 


0.2 


0.391 


0.079 


1.6 


0.111 


0.445 


0.3 


0.381 


0.118 


1.8 


0.079 


0.464 


0.4 


0.368 


0.155 


2.0 


0.054 


0.477 


0.5 


0.352 


0.191 


2.2 


0.035 


0.486 


0.6 


0.333 


0.226 


2.4 


0.022 


0.492 


0.7 


0.312 


0.258 


2.6 


0.014 


0.495 


0.8 


0.290 


0.288 


2.8 


0.008 


0.497 


0.9 


0.266 


0.316 


3.0 


0.004 


0.499 


1.0 


0.242 


0.341 
Tabl 


3.2 
E V. 


0.002 


0.499 
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With a table of values of Z at hand the plotting of a normal 
curve is a simple matter of arithmetic. Each deviation from 
the mean is divided by the standard deviation and then the 
table is entered with the quotients, x/a, and the values of Z ob- 
tained by interpolation. Multiplication of the interpolated values 
by the ratio, AT/cr, gives the successive values for y. 

In the following illustrative plotting of the normal curve 
of the distribution of student heights the ordinates are com- 
puted for the boundaries (as the fractional deviations in the first 
column of Table VI denote) instead of the midpoints of the 
class intervals. This is done, as will be presently explained, in 
order to find the areas under the curve. In the computation 
scheme, the first column is for the deviations; the second for 
the deviations from the mean; the third, the deviations from 
the mean divided by the standard deviation; the fourth, the 
values of z obtained from the table ; and the fifth column shows 
the desired values for the ordinates. The sixth and seventh 
columns are explained on page 63. 

Table of Z's and Corresponding Areas for Student Height 















Student 


Detnations. X 


X/a 


Z 


Y 


Areas. 


Ht. Areas. 


0.5 


7.4 


3.20 


0.002 


1. 


0.999 


749 


1.5 


—6.4 


2.77 


0.01 


3.2 


0.997 


748 


2.5 


—5.4 


2.34 


0.03 


9.7 


0.990 


743 


3.5 


—4.4 


1.91 


0.06 


19.5 


0.972 


729 


4.5 


3.4 


—1.47 


0.14 


45.5 


0.929 


697 


5.5 


—2.4 


—1.04 


0.23 


74.7 


0.851 


638 


6.5 


—1.4 


—0.61 


0.33 


107.1 


0.729 


547 


7.5 


0.4 


0.17 


0.39 


126.6 


0.567 


425 


7.9 


—0.0 


—0.00 


0.40 


120.0 


0.500 


375 


8.5 


+0.6 


+0.26 


0.39 


126.6 


0.602 


452 


9.5 


+1.6 


+0.69 


0.31 


100.7 


0.745 


559 


10.5 


+2.6 


+1.13 


0.21 


68.2 


0.871 


653 


11.5 


+3.6 


+1.56 


0.12 


39.0 


0.941 


706 


12.5 


+4.5 


+1.99 


0.06 


19.5 


0.997 


733 


13.5 


+5.6 


+2.43 


0.02 


6.5 


0.993 


745 


14.5 


+6,6 


+2.86 


0.01 


3.2 


0.998 


749 








Table VI. 
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The computed points are now plotted and the curve drawn. 




Fig. VIII. The Normal Curve of Student Heights. 

In Figure VIII the characteristic bell shape of the normal 
curve is seen. The ordinates at the center do not change rapidly. 
As the deviations increase the ordinates first decrease rapidly 
and then more slowly until the curve flattens out so as to be 
almost coincident with the horizontal axis. 

It is mathematically evident from the form of the equation 
of the normal curve that in a distribution with a large standard 
deviation the values for y are relatively small near the center and 
relatively large for the greater deviations. That is a large value 
for a indicates a flat normal curve. 



Exicrcises. 

1. Plot a normal curve for the distribution of weight according to 
the data of page 29. ^ 

2. Compare the curve obtained in Exercise 1 with the smooth curve 
of Chapter III. How closely do they coincide? 
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Areas under the Normal Curve. For each class the area 
under the frequency curve and between the class limits gives 
the smoothed class frequencies. In Sheppard's Table* the areas 
are given for each value of x/a from o.oo to 6.00. Table V 
is sufficiently complete for illustrative purposes. The tables give 
the areas from x/a = o to each designated value of x/a. (In 
Sheppard*s Tables 0.50,000,000 is added to these areas). Hence 
to determine the class areas from the tabular values appropriate 
subtractions are necessary, except in the case of the central area 
which is obtained by adding the two central fractional areas. It 
must be remembered that the actual area is finally obtained by 
multiplying the values just computed by the total frequency, 
and not by the total frequency divided by the standard devia- 
tion. 

The sixth column of Table VI contains, for the frequency 
distribution of student heights, the areas from the origin; the 
class areas, obtained by subtraction or addition, are entered in 
the second column of Table VII. For comparison the original 
frequencies are placed in the third column of the same table. 

The Adjusted Distribution of Student Heights. 







Orig. 


Pos. 


Neg 


lass. 


Areas. 


Freq. 


Dig. 


Dig, 


1 


1 


2 


• 9 


1 


2 


5 


10 


• • 


5 


3 


14 


11 


3 


• • 


4 


32 


38 


• • 


6 


5 


59 


57 


2 


• • 


6 


91 


93 


• • 


2 


7 


122 


106 


16 


• • 


8 


127 


126 


1 


• • 


9 


107 


109 


• • 


2 


10 


94 


87 


7 


■ • 


11 


53 


75 


• • 


22 


12 


27 


23 


4 


• • 


13 


12 


9 


3 


• • 


14 


4 


4 


• • 


• • 




748 


750 


36 


38 




Table VII. 


1 





* "Tables for Statisticians and Brometricians.*' Table II. 
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The goodness of fit of this normal curve is indicated by 
the differences of the fourth and fifth columns. The differences 
are taken positive when the adjusted values exceed the original 
frequencies. The sum of the positive and of the negative dif- 
ferences shows a fairly close fit, though the size of the individ- 
ual differences must also be taken into account in estimating the 
closeness of fit. 

Exercises. 

3. Test the closeness of fit of the normal curve of student weights 
plotted in Exercise 1. 

4. Compare the closeness of fit of tHe normal curves of weight and 
height. 

Preliminary Determination of Nonnality. Before fitting 
a curve to a given distribution the data should be analyzed to 
determine whether the fundamental conditions of normality — 
whether each value of the data is the sum* of a large number of 
relatively small elements each one of which is as likely to have 
one value as another, etc. — is satisfied. The data should be 
plotted and the smooth curve drawn by the methods of Chapter 
II. If the one or the other or both of these tests seem to in- 
dicate a normal distribution a normal curve should be fitted. 

A more elaborate test for normality is to compute the mean 
cubed and the mean fourth power deviation. Unless the mean 
cubed deviation is very small the distribution possesses too much 
skewness or asymmetry to be closely fitted by a normal curve. 
It can be shown that for a distribution to be normal the mean 
fourth power of the deviations divided by the square of the mean 
squared deviation must be equal to 3. The variation that may 
be allowed from these two arithmetical tests is shown by Tables 
XXXVII and XXXVIII of ''Tables for StatisticUms and Bio- 
metricians/' 

However, a practically conclusive test of the appropriate- 
ness of the normal curve is that of comparing the adjusted with 
the original frequencies. 

Exercises. 

5. Discuss the advisability of attempting to. fit a normal curve to 
the precipitation data of page 7. 

6. Is it likely that the frequency distribution of March tempera- 
tures will be more nearly normal than the distribution of temperatures 
for all months? 
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7. Discuss the probable fit of a normal curve in the case of the 
top beef cattle prices. 

8. Is it likely that a normal curve will fit the income data of page 
57. 

9. What does a divergence from normality indicate? 

10. What reasons are there for thinking that the distribution of 
grades in a large class of students should be normal? 

11. What reasons might exist for thinking that data of prices might 
not be normally distributed? 

Probable Deviation in a Normal Distribution. The 
quartiles divide the two halves of the area into equal parts; 
hence, in Table V, the value of x/(t which corresponds to an 
area of 0.25, gives the value of the probable deviation. This value 
of x/cr is there found equal to 0.6745. Therefore the deviation of 
the quartile is 0.6745 times the standard deviation. This demon- 
strates the rule for obtaining the probable deviation — multiplying 
the standard deviation by 0.6745. 

The formulas for the probable deviation of the arithmetic 
mean and of the standard deviation referred to in the preceding 
chapter are derived on the assumption that the two constants are 
each normally distributed. 

It can be shown mathematically that even when the form 
of distribution is distinctly non-normal the ordinary rules for 
finding the probable deviations hold with an approximation close 
enough for practical purposes, and experimentation with dif- 
ferent forms of distributions bears out the mathematical con- 
clusions. 

Exercises. 

12. What is the deviation corresponding to the ordinate which 
marks off three- fourths of the area to the right of the center? 

13. What part of the area under the normal curve is included be- 
tween the median and the ordinate with a deviation of two times the 
standard deviation? Three times the standard deviation? Four times the 
standard deviation? 

The results of Exercise 13 show that the occurrence of a 
deviation of three times the standard deviation is highly improb- 
able. That is, a deviation greater than about three times the 
standard deviation must significantly indicate that the measure- 
ment is not that of an individual taken from the same material ; 
it does not belong to the same distribution but to another dis- 
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tribution which has some conditions different from the first. To 
illustrate, the standard deviation of student heights is 2.36 
inches and the mean height is 67.9 inches. One would accord- 
ing to this theory be justified in concluding that a person with a 
height of 75 inches (67.9 + 3x2.36 = 74.98) does not belong 
to the student group. 

13. Does the theory of Exercise 13 accord with the actual distribu* 
tion of the student heights? 

14. Does the theory of Exercise 13 accord with the actual distribu- 
tion of the student weights? With the distribution of monthly precipita- 
tion for the Columbus Station? With the distribution of the monthly 
temperatures for the Columbus Station ? 

15. On the basis of the results of Exercises 13 and 14 and on other 
investigations of a similar nature discuss the practical applicability of the 
present theory of the probable deviation. 

While it is not advisable to place implicit confidence in the 
tests furnished by the theory of probable deviations to the ex- 
tent that the results which it indicates are accepted without some 
independent verification, or at least justification, yet when used 
with judgment it is an extremely valuable aid in practical statis- 
tical work. In every case it establishes cautionary limits, as, for 
instance, one would not ordinarily be justified in concluding 
that a variate with a deviation much greater than two or three 
times the standard deviation belonged to the distributioil. On 
the other hand if a number of measurements of height should 
each consistently exceed those of the student distribution it 
might then be concluded with much certainty that the in- 
dividuals measured were taken from a population distinctly 
different from the student population. And the conclusion 
would be justified even tho the deviations were considerably 
less than two or three times the standard deviation. 



CHAPTER VII. 



THE CORRELATION TABLE. 



From the records of the physical measurements a tabulation 
was made of the heights of the students whose weight was from 
130 to 134 pounds — a weight class which may be denoted by the 
middle weight, 132 pounds — and the following distribution 
obtained : 

Height 62 63 64 65 66 67 68 69 70 71 72 73 74 
Number 2 5 4 19 18 18 17 8 8 4 3 1 1 

The distributions were likewise obtained for each other five- 
pound interval from 102 to 187 pounds. Instead of writing each 
of these distributions separately it is more convenient to write 
them together in one table called, for reasons explained on page 
73, a correlation table. In this way we have Table VIII. 

Correlation Table of Height and Weight. 

Height in Inches. 
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z 

o 



H 

X 
o 





61 62 


63 

1 


64 


65 


66 67 68 


69 

1 


70 

• • 
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72 
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5 
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• • • • 


2 
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3 
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.. 2 


2 


10 


9 
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2 
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48 
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4 
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12 
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4 


5 
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76 
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7 


7 
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18 


9 


5 


2 
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93 
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4 
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8 


8 


4 


3 


1 


1 


93 
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3 


4 
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21 


11 


9 


2 


• • 


1 


110 


142 
147 
152 
157 
162 








7 
2 
2 


12 10 17 

3 7 5 
2 3 14 

4 1 6 
..2 2 


17 

12 

10 

7 

3 


8 

9 

12 

5 

8 


15 
8 

11 
7 
2 


5 
3 

• • 

1 
2 


2 

• • 

1 

• • 

2 


1 


95 
49 
56 
31 
22 




















1 .. 








167 










.. .. 1 


2 

1 


6 

1 
1 

• • 


1 
6 

1 

• • 


2 
2 

• • 

• • 


1 

• • 

• • 

1 


1 


13 

11 

8 

2 


172 










.. .. 1 


177 












182 










.. 1 .. 


• • 


187 
Totalj 










.. .. 1 


3 
109 


2 

87 


2 
75 


1 
23 


• • 

9 


4 


9 

750 


5 2 10 


11 


38 


58 


93 106 126 
Table VIII. 
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The writing of the distributions in this compact tabular 
form greatly facilitates the study and comparison of the different 
distributions. 

Exercises. 

1. Notice that there is a decided increase in weight with an increase 
in height ; that there are no extremely tall persons in the group who are at 
the same time extremely light in weight; that there are practically no 
persons who are both short and extremely heavy. 

2. Note that there is a closer connection between height and weight 
for the shorter and lighter individuals than for persons with medium 
values of the two characteristics. 

The Construction of a Correlation Table. Let us con- 
struct the correlation table of monthly precipitation and monthly 
mean temperatures for the Columbus Station. The data i^ 
given under Exercises 2 and 3 of Chapter III. Let the horizontal 
scale refer to temperatures and let each class of this scale have 
a width of five degrees. The vertical scale will then refer to 
precipitation and let the width of classes be taken as one-half 
inch. The scales are written across the top and down the left 
hand margin respectively in order to leave room for the sum- 
mations across the bottom and down the right hand margin. 
Under this arrangement of the scales y increases in value from 
top to bottom and hence the positive direction for y is downward. 

In constructing the table it is convenient to rewrite the data 
according to classes and at the same time to combine the two 
distributions. There is no need for retaining the dates but care 
must be taken that the measures from exactly the same months 
are written together. This is done by starting with January, 
1879, and proceeding with the Januarys and then February, 
1879, and so on in order. The temperature figures are written 
first in each pair of numbers, and the lower limit is written as 

the class number of each class. Thus ^- refers to a month with 

a mean temperature from 25 to 29 degrees inclusive and with 
a precipitation from 1.5 to 1.9 inches inclusive. In this way 
there will be built up a table of the following form : 

20 20 20 25 25 

2.0 3.5 4.0 2.0 3.5, etc. 



25 


40 


20 


30 


25 


i-S 


4.0 


2.0 


4-5 


30 
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Next the rulings must be made for the table. The tabula- 
tion proceeds in the following manner: for the first pair of 
numbers find the 25 column and drop down this column to the 
precipitation class i . 5 and mark a score ; then to the 40 column 
and down to the 4.0 class and tally; then to column 20 and 
precipitation class 2.0; etc. 

The diagram of tallies, usually dots, is called the Scatter 
Diagram. 

Correlatian Table of Temperature and Precipitation. 

Precipitation in Inches. 



CO 



w 

H 

s 

A* 

H 





15 


20 


25 


30 


35 


40 


45 


50 


55 


60 


65 


70 


75 To'ls 


0.5 


• • 


3 


• • 


2 


2 


1 


1 


4 


1 


1 


1 


3 


1 


20 


1.0 


• • 


2 


4 


5 


2 


2 


■ • 


3 


3 


3 


4 


4 


3 


35 


1.5 


1 


2 


4 


7 


8 


5 


4 


6 


6 


4 


5 


10 


4 


66 


2.0 


2 




5 


5 


6 


6 


6 


5 


3 


3 


6 


7 


7 


65 


2.5 






3 


5 


5 


3 


3 


7 


6 


5 


5 


16 


2 


61 


3.0 






5 


7 


2 


4 


• • 


5 


3 


1 


5 


5 


5 


43 


3.5 






5 


2 


2 


6 


2 


2 


1 


4 


3 


4 


6 


38 


4.0 






2 


2 


5 


• • 


2 


4 


2 


4 


2 


5 


1 


30 


4.5 






• • 


6 


3 


3 


1 


2 


1 


2 


• • 


3 


2 


24 


5.0 






2 




4 


4 


1 


2 


1 


1 


5 


6 


1 


27 


5.5 










3 


1 


• • 


1 


• 
• • 


2 


1 


3 


• • 


11 


6.0 








1 


• • 


2 


• • 


• • 


1 


1 




3 


1 


9 


6.5 










2 


• • 


• • 


• • 




• • 




1 


• ■ 


3 


7.0 












• • 


1 


1 




1 


1 


1 


2 


7 


7.5 








1 




1 


• • 


• • 




1 




• • 


• • 


3 


8.0 












1 


• • 


• • 




• • 




• • 


• • 


1 


*8.5 












• • 


• • 


• • 




1 


1 


• • 


• • ' 


2 


9.0 












• • 


• • 


• • 




• • 




• • 


• • 





9.5 












• • 


• • 


• • 




• • 




1 


• • 


1 


Totals 


3 


16 


31 


43 


44 


42 


21 


44 


30 


35 


39 


73 


35 


456 


' 












Table IX. 















Table IX, the correlation table, is made from the Scatter 
Diagram by inserting the frequencies in the place of the tallies. 



Exercises. 

3. Do wet months uniformly occur with warm months? or is there 
more of a tendency for wet and cold or cool months to be associated ? 

4. What may be said as to the tendency for dry and warm months 
to be associated ? for dry and cool months ? 
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5. Does there seem to be as close a connection between precipitation 
and temperature as between height and weight? 

6. Is it not possible that the real connection between precipitation 
and temperature in this table is obscured by the fact that data for all 
four seasons is thrown together? Explain. 

Definitions and Symbols. The properties, as height and 
weight or temperatures and precipitation are called the attri- 
butes or characteristics. 

The horizontal deviations are called the x classes or 
deviations, and the vertical, the y classes or deviations. Each 
subclass or subgroup thus has a value of x and of y associated 
with it. It is convenient to number the x and y classes from 
left to right and from top to bottom, respectively, and use these 
numbers for class numbers instead of the actual class values. 
Thus there are 17 persons with height 66 inches and weio^ht 
122 pounds; and 4 months with a mean temperature of from 41 
to 45 degrees and a precipitation of from 3.0 to 3.5 inches. In 
terms of x and y, the subclass x =6, y = S has a frequency of 17 ; 
the subclass ^ = 5, y = 6 has a frequency of 4 months. 

The columns and rows are spoken of as arrays; the col- 
umns as y-arrays of type x and the rows as x-arrays of type y. 
Or the concrete names of the data may be given to the arrays 
— the weight array of height 67 inches; the height array of 
weight 132 pounds; the precipitation array of temperature 40 
degrees, etc. It should be noted that the weight array of height 
type 67 inches is the distribution with respect to weight of the 
persons having a height of 67 inches; the precipitation array of 
type 40 degrees is the precipitation distribution of the months 
having a mean temperature of 40 degrees. 

A >y array of type x and an x array of type y are said to 
be arrays of opposite sense. Two y arrays or two x arrays 
are arrays of the same sense. 

The frequency of a y array is denoted by the symbol n, 
where x is the type of the array. The frequency of an x array 
is denoted by the symbol ny, where y is the type. The frequency 
of a subclass is denoted by the symbol Uxy, where x and y are 
the deviations of the subclass ; that is, the types of its two arrays. 
Thus, nf^^=z2; ^132 = 93; w,,^ -142 =12, or if the simpler class 
numbers are used, n.1^2; «.7 = 93; ^6:© = 12. When the lat- 
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ter form of class numbers is employed it is necessary to dis- 
tinguish, between x and y class numbers by means of a colon. 
Sometimes the distinction between x and y deviations or class 
numbers is made by the use of subscripts as Hxx 72- 

Exercists. 

7. Write the values of n^ for jt — 2, 4, 9 in the precipitation data. 

8. Write the values of fv^A «»:« for both the height- weight and the 
precipitation-temperature data. 

9. Practice stating the frequencies of the various arrays and sub- 
groups; e.g. the frequency of the weight array of type 8 (68) is 126. 

10. Note that n^.i + n^.i -\- Ut-.i ^ .... fh4:T = n:T = 93, for the 
height-weight data. 

11. Write other statements in the form of that of Exercise 10. 

The mean of the vertical column of totals is called the 
mean of all the weights, and in general, the mean of all the y^s; 
and is denoted by the symbol y. It is the mean of the vertical 
deviations of the variates when unclassified with respect to the 
horizontal attribute; the mean weight for all heights; the mean 
monthly precipitation disregarding temperature; the mean 
monthly precipitation for all temperatures taken together. 

Likewise, the mean of all the x's is denoted by the symbol 

X. 

The means of the weight arrays are denoted by the sym- 
bols, ^eij 5'62> 5'63- Iri general the mean of the y-array of type 
X is denoted by the symbols yx. The mean of the x-array of 

type y is denoted by the symbol Xy. 

Exercises. 

# 

12. From the following data construct the correlation table of top 
hog and top beef cattle prices at Chicago. 
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Chicago Monthly Top Hog Prices. 

Years. Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct. Nov. Dec. 

1916. $ 8.10 $ 8.90 $10.10 $10.05 $10.35 $10.15 $10.25 $11.55 |11.60 $10.35 $10.35 $10.80 

1915 7.40 7.25 7.05 7.90 7.95 7.95 8.12 8.05 8:50 8.95 7.75 7.10 

1914 8.00 8.90 9.00 8.95 8.67 8.50 9.30 10.20 9.75 9.00 8.25 7.75 

1913 7.80 8.70 9.62 9.70 8.85 9.00 9.62 9.40 9.65 9.10 8.30 8.15 

1912 6.70 6.57 7.95 8.20 8.05 7.30 8.50 9.00 9.27 9.42 8.30 7.85 

1911 8.30 7.90 7.35 6.90 6.50 6.72 7.55 7.95 7.80 6.90 6.72 6.60 

1910 9.05 10.00 11.20 11.00 9.35 9.80 9.60 9.70 10.10 9.65 8.70 8.10 

1909 6.70 6.95 7.15 7.60 7.55 8.20 8.45 8.32 8.60 8.40 8.45 8.75 

190& 4.72 4.70 6.35 6.45 5.90 6.67 7.10 7.10 7.60 7.20 6.40 6.15 

1907. 7.05 7.25 7.10 6.90 6.65 6.42 6.65 6.72 7.00 7.00 6.32 5.30 

1906. 5.72 6.42 6.55 6.82 6.67 6.85 7.00 6.75 6.82 6.85 6.50 6.55 

190S 5.00 5.12 5.55 5.72 5.65 5.70 6.17 6.45 6.20 5.80 5.25 5.35 

1904 5.20 5.30 5.82 5.50 4.95 5.40 5.90 5.80 6.37 6.30 5.25 4.87 

1903. 7.10 7.65 7.87 7.65 7.15 6.45 6.10 6.20 6.45 6.50 5.50 4.9a 

1902 6.85 6.60 6.95 7.50 7.50 7.9) 8.25 7.95 8.20 7.92 6.95 6.80 

1901 5.47 5.65 6.20 6.25 6.05 6.30 6.40 6.75 7.37 7.10 6.30 6.90 

1900 4.92 5.10 5.55 5.85 5.57 5.42 5.55 5.57 5.70 5.55 5.12 5.10 

1899 4.05 4.05 4.00 4.15 4.05 4.00 4.70 5.00 4.90 4.90 4.35 4.45 

1898 4.00 4.27 4.17 4.15 4.80 4.50 4.17 4.20 4.15 4.00 3.85 3.75 

1897 3.60 3.75 4.25 4.25 4.05 3.65 4.00 4.55 4.65 4.40 3.80 3.60 

1896 4.45 4.35 4.35 4.15 3.75 3.60 3.70 3.75 3.50 3.65 3.67 3.65 

1895 4.80 4.65 5.30 5.42 4.97 5.10 5.70 5.40 4.65 4.50 3.85 3.75 



Chicago Monthly Top Beef Cattle Prices. 

Years.Jan. Feb. Mar. Apr. May. June. July. Aug. Sept. Oct.* Nov. Dec 
1916 $9.85 $ 9.75 $10.05 $10.00 $10.90 $11.50 $11.30 $11.50 $11.50 $11.65 $12.40 $13.00 



1915 


9.70 


9.50 


9.15 


8.90 


9.65 


9.95 


10.40 


10.50 


10.50 


10.60 


10.55 


11.60 


1914 


9.50 


9.75 


9.75 


Q.'^S 


9.60 


9.45 


10.00 


10.90 


11.05 


11.00 


11.00 


11.40 


1913 


9,50 


9.25 


9.30 


9.2") 


9.10 


9.20 


9.20 


9.25 


9.50 


9.75 


9.85 


10.25 


1912 


8.75 


9.00 


8.85 


9.00 


9.40 


9.60 


9.85 


10.65 


11.00 


11.05 


11.00 


11.25 


1911 


7.10 


7.05 


7.35 


7.10 


6.50 


6.75 


7.35 


8.20 


8.25 


9.00 


9.25 


9.35 


1910 


8.40 


8.10 


8.85 


8.65 


8.75 


8.85 


8.60 


8.50 


8.50 


8.00 


7.75 


7.5^ 


1909 


7.50 


7.15 


7.40 


7.15 


7.30 


7.50 


7.65 


8.00 


8.50 


9.10 


9.25 


9.50 


1908 


6.40 


6.25 


7.50 


7.40 


7.40 


8.40 


8.25 


7.90 


7.85 


7.65 


8.00 


8.00 


1907 


7.30 


7.25 


6.90 


6.75 


6.50 


7.10 


7.50 


7.60 


7.35 


7.45 


7.25 


6.35 


1906 


6.50 


6.40 


6.35 


6.35 


6.20 


6.10 


6.50 


6.85 


6.95 


7.30 


7.40 


7.90 


1905 


6.35 


6.45 


6.35 


7.00 


6.85 


6.35 


6.25 


6.50 


6.50 


6.40 


6.75 


7.00 


1904 


5.90 


6.00 


5.80 


5.80 


5.90 


6.70 


6.65 


6.40 


6.55 


7.00 


7.30 


7.65 


1903 


6.85 


6.15 


5.75 


5.80 


5.65 


5.15 


5.65 


6.10 


6.15 


6.00 


5.85 


6.00 


1902 


7.75 


7.35 


7.40 


7.50 


7.70 


8.50 


8.85 


9.00 


8.85 


8.75 


7.40 


7.75 


1901 


6.15 


6.00 


6.25 


6.00 


6.10 


6.55 


6.40 


6.40 


6.60 


6.90 


7.25 


8.00 


1900 


6.60 


6.10 


6.05 


6.00 


5.85 


5.90 


5.85 


6.20 


6.15 


6.00 


6.00 


7.50 


1899 


6.30 


6.25 


5.90 


5.85 


5.75 


5.75 


6.00 


6.65 


6.90 


7.00 


7.15 


8.25 


1898 


5.50 


5.85 


5.80 


5.50 


5.50 


5.35 


5.65 


5.75 


5.85 


5.90 


6.25 


6.25 


1897 


5.50 


5.40 


5.65 


5.50 


5.45 


5.30 


5.25 


5.50 


6.00 


5.40 


6.00 


5.65 


1896 


4.00 


4.75 


4.75 


4.75 


4.55 


4.65 


4.60 


5.00 


5.30 


5.30 


5.45 


6.50 


1895 


5.80 


5.80 


6.60 


6.40 


5.25 


6.00 


6.00 


6.00 


6.00 


5.60 


5.00 


5.50 



13. From the data of Exercise 12, construct the correlation table of 
hog prices and months of the year. 

14. From data obtained from a financial journal construct a correla- 
tion table of the prices of common and preferred stocks. 
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15. In the correlation table of Exercise 12 does there appear to be 
a sharp tendency for the beef cattle arrays to vary with the changing live 
hog prices ? Is the tendency more pronounced at some parts of the table 
than at others? 

16. Compare the tendencies for close connection between the at- 
tributes in the table of Exercise 13 with that in Exercise 12. 

Correlation. In the table of student heights and weights 
there is a decided tendency for heaviness and tallness to be 
associated and for lightness and shortness to be associated. 
There is likewise a pronounced tendency for the prices of 
live hogs and beef cattle to vary together. It is to be noted 
that the two series of measurements do not vary together in 
every case; that is, there are months in which the price of 
hogs is low but the price of beef high. But when all the 
months of an array are taken together the general tendency 
for the progressive increase of beef cattle prices with each 
increase of hog prices is evident. Two characteristics are said 
to he correlated when there is a tendency for the changes in 
the value of one to depend on the changes in the value of the 
other. The two characteristics may increase together or one may 
increase while the other decreases and even in a part of the table 
the movement of the changes may be together and in another 
part the two series of changes may move in opposition ; the es- 
sential evidence for the presence of correlation is that the meas- 
urements change from array to array. 

In uncorrelated data there is no tendency for the distribu- 
tions of the arrays to change from type to type. 

In perfectly correlated data there is an exact connection 
between the values of the two characteristics. If height and 
weight were perfectly correlated, for instance, all persons of a 
given' height, say 68 inches, would be of the same weight and 
hence all the frequencies of the weight array of type 68 would 
lie within a single subgroup. Between the two extremes of per- 
fect and of no correlation there are all degrees of correlation. 

Exercises. 

17. Study the degrees of correlation shown by the tables con- 
structed in working the exercises of this chapter. 

18. Is it possible to find actual data which shows absolutely no cor- 
relation? Construct an imaginary table which shows no correlation. 



CHAPTER VIII, 
THE CORRELATION RATIO. 

The Mean as Representative of the Array. In Chapter 
IV it was stated that the modal deviation is the most frequent 
deviation; that is, the most typical deviation of a distribution. 
Because the mode cannot be computed by a simple and uniform 
process of arithmetic the mean is a more practicable representa- 
tive of the array. And this substitution of the mean for the 
mode will rarely produce a serious error. 

Since the mean of the frequencies of an array is taken as 
the representative of the deviations of the array, from the defi- 
nition of correlation on page 73 it is apparent that the amount 
or degree of correlation in the data will be indicated by the varia- 
tion in the means from array to array. 

Regression Curves. The variation in the means of the 
arrays is shown graphically by the curve of means, which is 
called a regression* curve. 

Since there are two sets of arrays there are two regression 
curves. 

Coordinate Axes. It is usual to take for the horizontal 
or .f-axis the horizontal line thru the mean of all the y's ; that is, 
the horizontal line at a distance y below the base line of the 
table, and for the y-axis the vertical line distant x from the left 
marginal vertical. The point of intersection of these two lines 
is called the center of the table. Deviations to the right are 
taken positive and those to the left negative; deviations down- 
ward from the new horizontal axis are positive and deviations 
upward are considered negative. Sometimes this convention of 
plus downward and negative upward is departed from. No con- 
fusion can result however if it isi remembered that the directions 
in which an attribute is increasing is always taken as positive. 



* So called by Francis Galton for certain reasons which arose in his 
investigations in biology. The name has become general. 

(74) 
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Exercises. 

1. Draw the axes and regression curves for each of the correlation 
tables of Chapter VII. 

2. Study and compare the forms of the regression curves of Ex- 
ercise 1. 

Correlation and the Regression Curves. In uncorrelated 
data the means of an array does not depend on the type of the 
array ; that is, does not change from array to array, and hence the 
unchanging value of the means must be the same as the mean of 
all the /.$". 

The regression curve for uncorrelated data therefore ap- 
proximates a straight line coinciding with the horizontal axis. 
For correlated data the regression curve diverges or deviates 
from this position of coincidence with the axis. It must be noted 
that the shape of the regression curve may be quite irregular 
without effect on the degree of correlation present in the data; 
it is the distance of the means from the axis that counts in de- 
termining the degree of correlation present. Hence any numeri- 
cal measure of the extent of correlation in the data must de- 
pend on the deviation of the means from the horizontal axis 
thru the center. 

Since there are two regression curves and two axes there 
are two correlations in each correlation table and their numerical 
measures involve the deviations of the respective regression 
curves from the corresponding straight lines thru the center. 
Thus the dependence of height on weight and of weight on 
height are two distinct correlations. 

Mean Squared Deviation of the Means of Arrays. The 
mean squared deviation is the most convenient measure of the 
deviation of the means of the arrays. In computing this the 
means of the arrays are first written in a vertical column and 
then the difference between each mean and the mean of all the 
variates is set down in a second column. Because the differences 
are used only in the squared form it is not necessary to retain a 
negative sign. 

The third column in the computations of Table X, page 
yy, contains the squares of the differences. Since the means 
of the array are used as the representatives of the individuals 
of the respective arrays each of these individuals is possessed of 
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the squared deviations. Hence each square must be multiplied 
by the respective frequency of the corresponding array. The 
resultant products form the fourth column. The sum of this 
fourth column is the total sum of squared deviations and this 
sum divided by the total frequency is the mean squared 
deviation. 

The Correlation Ratio. The mean squared deviation just 
obtained would be a significant measure of correlation were it 
not for the fact that it does not take into account the disper- 
sion of the data as a whole Without changing the mean and 
the frequency of a single y-array, it would be possible to 
spread out each array to twice its length. This alteration 
would concern the dispersion of the data as a whole but 
would leave the mean square deviation from the horizontal 
axis unchanged. It is evident that the value of the mean 
square deviations of the means of the arrays is of less 
significance in the more spread out data. Hence the disper- 
sion of the data as a whole must be considered in interpret- 
ing the value of the mean squared deviation. The dispersion 
of the data as whole is given by the standard deviation of 
the frequencies of the totals in the vertical sum column. The 
smaller this mean square deviation the more significant is the 
deviation of the means, and the larger this standard deviation 
the less significant the deviation of the means. It is therefore 
reasonable to divide the square root of the mean square deviation 
of the means of the arrays by the standard deviation from the 
marginal column. The quotient is called the correlation ratio, and 
is denoted by the Greek letter >/. 

The computation of the correlation ratio for the dependence 
of student weight on height follows. 

A carefully planned outline scheme of computation must 
be made before the figures are entered. 

The means and the one standard deviation were computed 
in the usual manner. We have, for the data as a whole, 
5^ = 7.9, (j2 = 9.79. The means of the arrays are written in 
the second column just after the frequencies. The differences 
between the means and ^ follow in the third column. The 
squares, and the product of the squares by the frequencies are 
the fourth and fifth columns respectively. The symbols ex- 
plained in Chapter Vllare written at the head of each column. 
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Computation of 17. 




"x 


yx 


y yx 


(y-yx)^ 


"x (y yx) 


2 


9.5 


1.6 


2.56 


5.12 


10 


4.7 


3.2 


10.24 


102.40 


11 


4.4 


3.5 


12.25 


134.75 


38 


4.6 


3.1 


9.61 


365.18 


57 


6.1 


1.8 


3.24 


184.68 


93 


6.8 


1.1 


1.21 


112.53 


106 


6.9 


1.0 


1.00 


106.00 


126 


8.0 


0.1 


0.01 


1.26 


109 


8.8 


0.9 


0.81 


89.19 


87 


9.7 


1.8 


3.24 


281.88 


75 


10.9 


3.0 


9.00 


675.00 


23 


10.3 


2.4 


5.76 


132.48 


9 


11.1 


3.2 


10.24 


92.16 


4 


10.5 


2.6 


6.76 


27.04 


750 




2309.67 








2309.67 








2 _ 




— 3146 






1; - 


750X9.79 








^ = 


0.56 








Table X. 








Exercises. 





3. Compute the value of ly for the dependence of monthly precipi- 
tation upon monthly mean temperature as shown by the data of the 
Columbus Weather Station. 

4. Compute the value of -q for the correlation of Chicago top hog 
prices with Chicago top beef cattle prices as shown in the table of Ex- 
ercise 12 of the preceding chapter. 

5. Compute values of 17 from the tables of Exercises 13 and 14 of 
Chapter VII. 

Two Values for 1; in Each Table. From the method of 
computation it is clear that there are two values for y\ in each 
correlation table, one for each regression curve. The cor- 
relation ratio of weight with height, for instance, may differ con- 
siderably from the correlation ratio of height with weight; 
the dependence of precipitation on temperance may be of a 
decidedly different degree from that of temperature on pre- 
cipitation. The two values of >y do not ordinarily differ markedly 
but there can be no apriori assurance that they will be essentially 
of equal value and hence it is necessary to compute the two values 
separately in case both are desired. To distinguish the two 
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measures, for the dependence of y on x, of weight on height, the 
symbol r^y is used and the symbol -qx refers to the dependence 
of X on y. 

Exercises. 

0. Compute the value of >y for the correlation of height with 
weight and compare with the other value of ji computed on page 77. 

7. Compute the value of -^x from the precipitation-temperature 
correlation table, and compare the values oi -q^^ and -qy 

8. Compute the value of >y^ for the Hve stock price table of Ex- 
ercise 12, Chapter VII, and compare with the value of ^y from the same 
table. 

Limiting Values of the Correlation Ratio. In theory, the 
means lie exactly on the axis for data of zero correlation. 
Each separate item, therefore, in the mean square deviation 
of the means is zero and hence -q is zero for zero correlation. 

Because each term of the mean squared deviation of the 
means is squared and hence necessarily positive any accidental 
fluctuations of the means of the arrays in data of essentially 
zero correlation increase the value of ?;. Since there are no com- 
pensating flunctuations, the result is that small values of -q are 
quite likely to be too large and hence the statistical significance 
of ri for data of a small degree of correlation is open to ques- 
tion. The degree of correlation in such cases cannot be greater 
than the value of -q would indicate and it may be less. It must 
be evident from the nature of the error that for material show- 
ing a considerable degree of correlation the error from this 
source is negligible. 

As discussed in Chapter VII, in perfectly correlated data 
the frequencies of each array are concentrated in a single class 
or subgroup of the array. According to the shape of the re- 
gression curve two cases can arise for perfectly correlated data, 
as the following imaginary distributions illustrate. 
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In the first distribution it is evident that the mean squared 
deviation of the means is obtained by exactly the same numerical 
work and by the use of the same numbers as is the standard 
deviation of all the y's and hence >/, which is the ratio of these 
two measures, must be equal to unity. A study of the second 
distribution leads to the conclusion that in this case also the two 
measures of deviation are equal and hence rj is unity as it is in 
the first distribution. 

Exercises. 

9. Compare the values of v that have been computed with the 
general appearance of correlation in the tables. 

10. Can a tendency be detected for the two values of v to be 
closer together in value for highly correlated data than for data of 
smaller correlation? 

11. Describe in general terms the two species of perfect correla- 
tion illustrated above. 

Probable Deviation of the Correlation Ratio. It can be 
proved that the probable deviation of a correlation ratio is given 
by the formula 

P. E. rf=z 0.6745 -—=— 
Exercises. 

12. Compute the probable deviations of the correlation ratios of 
this chapter. 

The probable error is of much practical use in estimating 
the significance of values of rf tho of course the facts of its 
theoretical derivation must always be kept in mind. It is 
assumed in the derivation of the formula that the data is 
strictly homogeneous thruout; that all the fluctuations from 
sample to sample are merely those of random sampling; that 
the regressions are truly linear; and that each array has the 
same spread. 

In working with correlations, especially where the total 
frequencies are not large, it is always well to obtain a con- 
siderable number of distributions. Then if there proves to be 
a consistency in the value of rj greater confidence can be placed 
in those values than if there was only one distribution. Thus 
if fifty groups of 750 students were each measured for height 
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and weight and the computed values for rj should show a de- 
cided tendency to agree in value, increased weight could be given 
to these values of 17. 

Spurious Correlation. In interpreting the computed value 
of any measure of correlation care must be taken that the cor- 
relation is not merely apparent and due to the method of obtain- 
ing the data. Wages and general prices appear to be highly 
correlated but how much of this apparent connection is due to 
the fact that both are expressed in terms of money which has 
cheapened and consequently, at least to some extent, caused 
both wages and general prices to increase. When money is 
becoming cheaper both wages and prices tend in general to rise 
together; when money is becoming dearer both wages and prices 
tend in general to fall together. Such variations in the pur- 
chasing value of money could thus introduce a considerable 
element of apparent connection between the two attributes — 
wages and general prices — even when there was no tendency 
whatever for real wages to change, that is, no change in the 
amount of goods that the laborer could purchase with his wages. 

Exercises. 

13. In which of the correlation of this chapter is there a possi- 
bility of spurious correlation? 

14. Show that in correlating index numbers especial care is nec- 
essary in interpreting values of 1, 

15. Show that where there is an element of spurious correlation 
present the correlation is real in so far as the measurements themselves 
are concerned. 



CHAPTER IX. 
THE COEFFICIENT OF CORRELATION. 

Linear Regression. A straight line fitted to the means 
of the arrays is called a line of regression. A line of regres- 
sion smooths the curve of regression. Whenever a curve of 
means approximates a straight line the regression is said to 
be sensibly linear. If the regression curve, within the limits 
of accuracy of the data, is exactly a straight line the regres- 
sion is said to be truly linear. 

The slope of a regression line shows the broad general 
tendencies of the connection between the attributes. Does 
weight tend to increase as height increases? Does the monthly 
precipitation increase with an increase ol temperature? If so 
at about what rate? These are questions that depend for an 
answer on the slopes of the regression curves. It may happen 
that in some correlation tables the regression curves deviate 
in form so widely from a straight line that the line of regression 
has but little significance; in such cases the usefulness of any 
statement of general tendencies is open to question. 

Exercises. 

1. Draw by inspection the regression lines on the correlation table 
of student heights and weights. 

2. In Exercise 1 estimate the comparative degrees of correlation 
shown by the two regression lines. 

The Equations of the Lines of Regression. Let the 
coordinate axes be the two lines thru the center determined by 
the means of all the variates as described on page 74. Then 
yx and x are the coordinates of a point on the one regression 
line and Xy and y on the other. It must be understood, however, 

that the values of yx and Xy here referred to are the adjusted or 
fitted means of the arrays so that unless the regressions are truly 
linear these values will diflfer from the values obtained by actual 
computation. 

(81) 
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It is demonstrated in Chapter XII that the equation of the 
regression line of the means of the y arrays is 

y^ = r—x. 
And of the means of the x arrays, 

;r, = r— y; 

(Ty 

where o-y is the right hand marginal standard deviation and a, 
the bottom marginal standard deviation. The constant r is de- 

2 2 n^ty. x,y 
fined by the equation, r = . The expression 

Nax . (Ty 

22nxy.^y is a symbolic way of saying: the sum obtained by 
multiplying the frequency of each class by its deviation from 
the horizontal axis and then by its deviation from the vertical 
axis and then adding the products. 

According to the first of the two regression equations the 
mean weight for height 71 is obtained by substituting the value 
for X measured from the mean and multiplying and divid- 
ing as the formula directs. We found that for this data of 
student measurements o-y = 2.36 and o-x = 3.14. The value of r 
is found presently to be 0.56. The array is distant from the 
mean 3.1. Hence, 

, 2,36 

^8:1 or 5,, =1:0.56 3.1 

3.14 

= 1. 3 1, weight classes from the mean weight. 

This use of the regression lines to estimate the position of the 
means is often of practical value. 

The Coefficient of Correlation. Let us now compute the 
correlation ratio using, however, in case the regression is not 
truly linear, not the actual means of the array but the means 
given by the regression line. The deviation of a mean from 

<Ty 

the horizontal axis is r . . x. The square of this quantity 

(Tx 



multiplied by the frequency of the array is n^X' 



V cry>^ 



CTx' 



y 
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The last factor is the same for each array and the sum of the 
other factors leads to the standard deviation of all the x's. 
Hence we have, on carrying out the multiplications for each 



^7 r-(T - 
array and adding, r^ — . Sw^ . x- = ^. Na-x^ = Nr^trf, There- 



fore the mean squared deviation of the regression means of the 

arrays is = r^o-y^. On dividing this mean squared de- 
viation by the standard deviation of all the ys we have 



jt/i 2 

T Gy 



n 



= r^. That is, the constant r turns out to be the 

'y' 

correlation ratio when the regression means are used instead 
of the true means. It is called the CoefHcient of Correlation. 

Computation of r. For computation purposes the sum- 
mation SxSyWxyy^ can be arranged in the following manner. Let 
the subgroup frequencies of a given y array be each multiplied 
by the respective deviations, all deviations being measured from 
the axis thru the center, and the products summed. Divided 
by the frequency of the array this sum gives the mean v,. 
Hence the summation for the array is equal to the product of 
the mean % and the frequency n,. On making this substitution 
the original summation formula becomes Swx.j^x-^, or 
2nx(yx — y) (x — x) from the original axes. 

In the course of the computation of the correlation ratio 
the means ^^ are obtained and hence to the computation sched- 
ule of page yy only the additional column for the x deviations 
of each array is needed. Then the multiplication of the corre- 
sponding values from the nx, (5^x — y), and (x — x) columns 
gives thecolvunn which sums into the quantity 'SuxiSx — y) (x — x). 
This sum divided by the product of the three factors N, o-x, and 
(Ty gives the required value for r. 

Table XI which follows shows the computation. The 
value of (Ty is found in Exercise lo, Chapter V, to be 2.31, and 
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Computation of r. 



X 


^x 


3'x— ^^ 


»x 


X — X 


M^(X— jr) M, 


Kx—x) 


1 


9.6 


—1.6 


2 


—6.9 


13.8 


22.08 


2 


4.7 


—3.2 


10 


—3.9 


29.0 


92.8 


3 


4.4 


—3.5 


11 


—4.9 


—53.9 


188.65 


4 


4.6 


—3.1 


38 


—3.9 


148.2 


459.42 


5 


6.1 


1.8 


57 


—2.9 


—165.3 


297.54 


6 


6.8 


1.1 


93 


1.9 


176.7 


194.37 


7 


6.9 


—1.0 


106 


—0.9 


93.4 


93.4 


8 


8.0 


+0.1 


126 


+0.1 


+12.6 


1.26 


9 


8.8 


+0.0 


109 


+1.1 


+119.9 


0. 


10 


9.7 


+1.8 


87 


+2.1 


+182.7 


328.86 


11 


10.9 


+3.0 


75 


+8.1 


+232.5 


697.5 


12 


10.3 


+2.4 


23 


+4.1 


+94.3 


226.32 


13 


11.1 


+3.2 


9 


+5.1 


+45.9 


146.88 


14 


10.5 


+2.6 


4 


+6.1 


+24.4 


63.44 



Sn,(;r— 7) (y—y) =3018.42 



r = 



= 0.55 

Table XI. 



Exercises. 

3. Compute the value of r for the monthly precipitation and tem- 
perature data. 

4. Compute r for the top-hog and top-beef-cattle data. 

5. Compare the values of r in Exercises 3 and 4 and in the weight- 
height data with the corresponding values for 17. 

6. Compute the value of r from the monthly price-of-hogs data 
of Exercise 12, Chapter VII. Compare with the corresponding value 
for 17. 

7. Does there seem to be a tendency for f\ and r to agree more 
closely for highly correlated data than for material of small correla- 
tion? , 

8. Compare the amount of labor involved in the computation of r 
with that involved in the computation of 17. 

The Relation of r to y\. In data exhibiting a regression 
which is truly linear the value of t] is, of course identical with 
that of r. In the case of any but truly linear regression it is 
readily shown in Chapter XII that the value of r is necessarily 
less than that of >/. In fact if the regression curve is of a 
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certain shape the value of r will be very small even tho prac- 
tically perfect correlation exists. 

Unlike the correlation ratio the coefficient of correlation 
expresses a property of the correlation table as a whole and not 
merely of one or the other of the two correlations of the table. 

Again, unlike the correlation ratio, the negative sign ob- 
tained in the final extraction of the square root (in the dis- 
cussion of page 76) has a significance; it indicates that the 
regression line has a negative slope and hence that the con- 
nection between the attributes is inverse; that is, one attribute 
increases while the other decreases. 

Because both positive and negative values of r can occur 
there is no tendency, as there is in the case of 17, for small values 
of r to be larger than the actual degree of correlation would 
warrant. 

Limiting Values for r. In data of zero correlation it is 
clear that the regression line coincides with the axis and hence 
the value of r must be zero. 

Reasoning from the relation of r to 17 we see that for truly 
linear regression perfect correlation leads to a value of r equal 
to unity. The unity value for r will be positive or negative 
according to the correlation is direct or inverse. According to 
the underlying theory of the coefficient of correlation for data in 
which a regression is not linear the value of r cannot be unity 
even tho there is perfect correlation and hence r is necessarily 
smaller in value than the degree of correlation would require. 

Statistical Properties of the Coefficient of Correlation. 
The coefficient r is, as the preceding discussions show, a con- 
servative measure of correlation. In periodic data exhibiting a 
sinusoid form for the regression curve the correlation may be 
high but because the departure of the regression from linearity 
is so wide the value of r understates the correlation and hence 
Its applicability in such data is not of importance. 

The characteristic impwDrtance of the coefficient r is in de- 
fining the slope of the regression lines. It furnishes the most 
convenient method for defining the general tendencies in the 
data. The rise of prices, for instance, during the last fifteen 
years can be readily measured by the rate of rise of the regres- 
sion line. 
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Therefore for the single purpose of measuring correlation 
the coefficient of correlation is distinctly inferior to the corre- 
lation ratio both in convenience and reliability. It should never 
be used as a measure of correlation without first carefully test- 
ing the form of the regression. It does have however the highly 
useful property of giving the slopes of the regression lines. 

Test for Linearity of Regression, It would be suspected 
from the preceding theory and discussion that the diflference 
between rj and r should be an indicator of the departure of the 
r^ression from linearity. A somewhat more convenient meas- 
ure of this departure than the simple difference is the difference 
of the squares of ry and r. 

Probable Deviations. The following probable deviations 

can be derived. 

(I— r^) 
P. E. of r = 0.6745 

P. E. of (>;=_r=) =i^ y^2_^3 

A practical criterion of linearity is to assume linearity when 

■VN 

Vy- — r- <2.5. 

1-35 

8. Compute the regression equations for each of the correlation 
tables of Chapter VII. 

10. How can the value of r be obtained graphically from the re- 
gression lines? Is this a practicable method of finding the value of r? 

11. Compute the measure of departure from linearity, (v' — k) for 
the correlation tables of Chapter VII. 

12. A correlation table has two measures of departure from 
linearity. Show that one regression may be linear and the other non- 
linear. 

13. Show that if the value of r is high the regressions must both 
be approximately linear. 

14. By extending the regression line estimate the price of live hogs 
for January, 1917. 

15. What weight should correspond to a student height of 50 
inches ? 

16. What is the best estimate of the temperature for a month 
with a precipitation of 3.4 inches? 

17. Discuss the value of the probable deviations in exercises 14-16. 



CHAPTER X. 
CORRELATION FROM RANKS. 

Rank in a Series. When the data consists not of the 
direct measurements of the characteristics but of their order or 
rank in a series the correlation of the ranks may differ mate- 
rially from the true variate correlation. Let us define rank as 
position in a series so that an individual of rank one would 
have no individuals above or before it; an individual of rank 
two would have one individual before it, etc. 

To pass from rank to variate correlation it is necessary to 
know the form of the distribution of the Values of the charac- 
teristics. Only for normal distributions has the requisite theory 
been developed. It is consequently necessary to employ the 
same formulas for other forms of distributions, although this 
may sometimes open the way to serious inaccuracies. 

Let the ranks of the same individual in regard to the re- 
spective characteristic be Vx and vy. Let there be N individuals 
and let vx and vy denote the respective means of the two series 
and ^vj, and ^vy the standard deviations. 

Also let all the measurements of each characteristic be dis- 
tinct in value ; that is, let there be no equal measurements. 

Theorem I. The mean ranks vx ciftd vy cire each equal 

to (N + l)/2. 

Since there are as many ranks as individual measurements 
and since the ranks proceed uniformly from i to iV the mean 

is {N+l)/2. 

Theorem II. The standard deviations of the ranks are 

N 
each equal to — . 

12 

For, iV<yv.2=2(vx — Vx)V 

I N{N + i) 

^ — N(N+l)(2N+l) — 2vx . [-Ny 

6 2 

(87) 



vx^ 
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on applying the rules :S,N^ = i/6N(N + i) (2N +1) and 

:S.N=i/2N(N+i), 

N{N + iy N(N+iy 

^i/6N(N+i)i2N+i) + , 

2 4 

I 

12 

I 
Therefore ^vx= = — ( A^- — i ) . 

12 

The following theorem is necessary for the computation 
of rank correlation. 

Theorem III. // <tx = <Ty, r =- / — ~ ^ 



CTx^ 



For, (.r — y)- = -r^ + 3'^ — 2.ry, 

or Na' (X - y) = N<T^' + Nay^ — 2:Sxy. 

and S (.r — y ) ^ = Xv' + 2y^ — 2^xy, 

But 2^y = r . AT . cTx o^y. 

Therefore A'cr- (x - y) = A^<^' — 2Nr<T^(Ty, 

^x + o^y" — <^ (x-y) 

and r= . 

2crxcry 



*» 



^ (X - y) 

If C7x = ory,r = I 

2<T^ 



Exercises. 

1. Show how to compute the value of r from the data of Table 

X ^ y (x — y) 



VIII by the formula r = 



2<r <r 
X y 



Theorem IV. The correlation coefficient of the ranks vx 
and vy is given by the formula, 

62 (Vx Vy)- 

'Vx Vy = I • 
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On making use of theorem 3, we have, 

^ Vx Ty = I , 



= I 






= 1 , from theorem 2, 

I 

12 



2(Vx Vy) 



2 






To illustrate the method let us compute the rank correla- 
tion between yearly mean temperature and yearly mean rain- 
fall for Ohio as shown by the data of Exercises 2 and 3 of 
Chapter II. The yearly means are obtained by adding the 
monthly means and dividing by twelve. 

The order of the twenty-four years in respect to tem- 
perature is written in the first column and in respect to rainfall 
in the second. The ties are disposed of by assigning the ranks 
in the inverse order of the time, thus 1903 and 1902 at 50.5 
and 1903 is given rank 15 and 1902, 16. But the matter of 
ties will be presently considered. The third column contains 
for each year the differences in rank with respect to the two 
attributes, temperature and rainfall, and the fourth the squared 
differences. On adding the fourth column and applying the 

eMvx — yjY 
formula r =» i we find r = o. 10. 
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Year, Temp. Rainfall. Difference, Sq. Diff. ' 



1911 




1 




4 


3 


9 


1910 




17 




6 


11 


121 


1909 




13 




5 


8 


64 


1908 




5 




19 


14 


196 


1907 




22 




3 


19 


361 


1906 




9 




14 


5 


25 


1905 




20 




10 


10 


100 


1904 




24 




17 


7 


49 


1903 




15 




16 


1 


1 


1902 




16 




13 


3 


9 


1901 




18 




24 


6 


36 


1900 




2 




21 


19 


361 


1899 




11 




18 


7 


49 


1898 




6 




2 


4 


16 


1897 




14 




11 


3 


9 


1896 




7 




9 


2 


4 


1895 




21 




23 


2 


4 


1894 




3 




22 


19 


361 


1893 




10 




7 


3 


9 


1892 




19 




14 


5 


25 


1891 




8 




12 


4 


16 


1890 




4 




1 


3 


9 


1889 




11 




20 


9 


81 


1888 


N 
-1) 


23 

= 24 
- 13,800 


8 


15 


225 


N(N^- 


= 2,150 
= 12,900 




r = 


= 1 
1 0.10. 


62 (.^ 










N(N^ 


-1) 





Ties in Rank. The application of the formula ^rxVy== 

I is straififhtf orward and direct. The only unccr- 

N{N^'—i) 

tainty arises from ties in the measurements. Thus in the pre- 
ceding illustrative example it is found that the temperature for 
each of the two years 1907 and 1894 is 52.3. What ranks are to 
be assigned to each of the measurements? In order to avoid 
complicating details in an illustrative problem, in the preceding 
computation we gave the later year the numerically smaller rank, 
but ordinarily it is better to base the ranks on one of the 
two plans : 
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(i). The Bracket Rank method, under which the ties 
are assigned the same rank and that equal rank is taken as the 
rank next greater than that of the individual preceding the ties. 
The next individual after the ties takes the same rank as if 
preceding ties had each been given ranks differing by unity. 
Thus under this method the ranks of the illustrative example 
are as given in the table below. 

(2). The Mid-Rank method, under which all ties. are 
given the same rank but that rank is the rank of the mid- 
individual. In the column below the two methods may be 
compared. 

Under either method the total number of ranks must be 
the same and equal to A^. 







Bracket 


Mid Rank 




Temperature. 


Method. 


Method. 


1911 


52.6 


1 


1 


1900 


52.3 


2 


3 


1894 


52.3 


2 


3 


1890 


52.3 


2 


3 


1908 


52.1 


5 


5.5 


1898 


52.1 


5 


5.5 


1892 


51.7 


7 


7.5 


1891 


51.7 


7 


7.5 


1906 


51.6 


9 


9.5 


1893 


51.6 


9 


9.5 


1899 


51.5 


11 


11 


1889 


' 51.1 


12 


12 


1909 


50.9 


13 


13 


1897 


50.6 


14 


14 


1903 


50.5 


15 


15.5 


1902 


50.5 


15 


15.5 


1910 


50.4 • 


17 


17 


1901 


50.2 


18 


18 


1892 


50.1 


19 


19 


1905 


50.0 


20 


20 


1895 


49.9 


21 


21 


1907 


49.6 


22 


22 


1888 


49.5 


23 


23 


1904 


48.6 


24 


24 



2. Compute ^''^''y from the above "bracket method" ranks. 

3. Compute V^^v^ from the above "mid-rank method" ranks. 



92 INTRODUCTION TO MATHEMATICAL STATISTICS 

Probable Deviation of the Rank Coefficient As given by 
Pearson ; 

0.6745 

P. E. of rVxVy = ( I ^VxVy). 

Perfect Rank Correlation. The ranks are perfectly cor- 
related, according to the formula, when S(vx — vy)- = o; that is, 
when each individual has the same rank in both series. Also 
there is perfect negative correlation when temperature and rain- 
fall are inversely related so that the year with the highest tem- 
perature is the year with the lowest rainfall and so on up to 
the year with the lowest temperature which is associated with 
the highest rainfall. In this case of perfect negative correlation 
when N is odd, 

{N-iy 

o+i-f 4 + 9-f . . 

2 

N—i N+i 

= 8 . J . . . A^, 

2 2 

2N(N'—i) 
Therefore r = i , 

= — I, a result according with the usual idea of inversely 
correlated attributes. 

Uncorrelated Data. According to the formula, the sum of 
the squares of the differences of the ranks is equal to the sum of 
the squares of the ranks when r = o. Thus when r = o sub- 
tracting the ranks has lost its significance — and this is exactly 
the idea of zero correlation. 

Hence the rank coefficient r, is accurately significant for 
both perfect and zero correlation. 

A Correction Formula for the Rank Coefficient. There 
is no assurance however that in general the rank r will exactly 
express the true variate correlation. For instance, note the two 
following series of deviations, 

100, 80, 70, 65, 62, 60, 55, so, 40, 20 ; and 100, 99, 98, 97, 96- 
95, 10, 9' 8, 7. 
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The ranks are the same in each series, namely,, 
I, 2, 3, 4, 5, 6, 7, 8, 9, lo . 

The coefficient r^x Vy which depends solely on the ranks, has the 
same value for a series of which the first is typical as it does 
for a series of which the second is typical. And yet the two 
distributions are fundamentally distinct in form. 

Therefore, except for the two extreme cases of material of 
very high and of very low correlation, the value of a correlation 
constant computed from ranks must be interpreted with caution. 

For a distribution which is approximately normal in form 
the following correction formula for ^Vx Vy has been derived by 
Pearson.* 



r 

xy 



2 sin g • ^vx Vy. 



From Table X the values of r^y can be obtained directly 
from the value of rvxVy for each 0.05 of rvxVy. 

Corresponding Values of Txy and ^ VxVy. 



''"x^y 


^xy 


ry^yy 


r 

xy 


0.00 


o-.oo 


0.55 


0.57 


0.05 


0.06 


0.60 


0.62 


0.10 


0.10 


0.65 


0.67 


0.15 


0.16 


0.70 


0.72 


0.20 


0.20 


0.75 


0.77 


0.25 


0.26 


0.80 


0.87 


0.30 


0.31 


0.85 


0.86 


0.35 


0.36 


0.90 


0.91 


0.40 


0.42 


0.95 


0.96 


0.45 


0.47 


1.00 


1.00 


0.50 


0.52 


• ■ • ■ 


• • • • 




Tabl 


E X. 





For other values of rvxVy the corresponding values of r^y 
are readily obtained by interpolation. 
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Probable Deviation of rxy Computed from Ranks. As 

given by Pearson* 

0.7063 
P, E. of rxy from ranks = — -^^(i — r-) . 

VAT 

Exercises. 

4. Determine V^^ from the value of Tv^v^ of Exercise 1. 

5. Compute the value of the rank r from the data of other ex- 
ercises and compare with the computed values of the variate r. 

Theorem V. Multiplying each rank v, or vy hy a constant 
does not change the value of r. 

For if each rank is multiplied by m, the means vx and vy 
are each multiplied by m so that the standard deviations are each 
multiplied by m^. Also 2(vx — vyY is multiplied by m- and hence 
r is unchanged. 

Theorem VI. Multiplying the ranks cf one coluffn hut not 
of the other in general produces a change in the value of ^vxVy 

In this case the formula r = i .}-, ' — -! 

N(N^ — ) 

cannot be used and the formula 

N(Jv^^ + N<Jv/ — S (Vz — Vz) ( Vy — Vy ) 

r = 

2raVxOVy 

must be employed. 

The Accuracy of the Coefficient r^y when computed from 
Ranks. When the measurements are arranged in ranks and 
the coefficient is computed from the ranks alone, the computation 
is based on the relatively limited information which the ranks 
can convey. Hence the resulting coefficient can not be as trust- 
worthy and reliable as the moment cofficient. However, when 
a detailed correlation table cannot be constructed owing to a 
paucity of information, it may still be possible to determine the 
rank of the individual. If proper allowance is made for the 
necessarily wide inaccuracy of the computed result, the rank co- 
efficient is better than no coefficient at all for such iruifcwrate 
or indeterminate data. 
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! CHAPTER XI. 

THE MOMENTS OF A DISTRIBUTION. 

Introduction. The first moment, obtained by multiplying 
each deviation by the corresponding frequency, adding the result- 
ing products and dividing by the total frequency of the distribu- 
tion, was discussed in Chapter IV in connection with the 
arithmetic mean. The second moment, in which the deviations 
are squared before multiplication by the frequencies, was dis- 
cussed in Chapter V. The third and fourth moments, with the 
deviations cubed and raised to the fourth power respectively, 
were also referred to in Chapter V. 

Obviously the moments may be computed about any point 
by obtaining the deviations from that point and raising to the 
appropriate power, etc. For most purposes, however, the sec- 
ond and higher moments are computed about the mean which 
thus serves as a standard origin for the moments. 

The moments about the mean are denoted by the symbols 
f i> /*2»M3» /^4> ^tc. where the subscripts refer to the order of the 
moments; that is, the index of the power to which the devia- 
tions are raised. Under the same system of notation, the 
moqients about any other point are denoted by /*/, 7x2', /I3', fi/. 
etc., with the primes serving to distinguish moments about the 
mean from moments about any other origin. 

The moments about the mean may be computed directly by 
first computing the mean and then subtracting the value of the 
mean from each deviation and using the resulting differences in 
the computations for the moments. This method of computing 
the moments has the advantages of simplicity and directness but 
it usually leads to troublesome fractions and it ordinarily in- 
volves more labor than the indirect methods which are described 
in this chapter. 

Transformation Formulas for the Moments about the 
Mean.' The formulas for the moments about the mean in 
terms of the moments about a fixed point will now be derived. 
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Let d be the mean deviation, that is, the distance of the mean 
from the fixed point of reference, and let the x^s be measured 
from the mean. Then corresponding to a given value of x there 
will be the deviation x' about the fixed point. 
From the definition of a moment we have, 

I I Nd 

N N N 

= f^i+d = rf, since l,xy is zero (Theorem, 

page 35, Chapter IV) ; 

I I d Nd 

' = A4 + d'\ since l,xy =0; 

I , ,. I 3d 3d^ Nd^ 

N N N N js^ 

> 

= /*3 + 3dfi.2 + d^; 

y,^^ =-L:^(x + dry 

N 
I \d 6rf' 4(/3 Nd'' 

N N N N N 

= /A4 + 4^/^3 + 6t/>2 + rf*- 

Transposing a part of the terms in the four preceding equa- 
tions and changing the signs, we have the following equations 
which express each moment about the mean in terms of the cor- 
responding moment about the fixed point and the moments of 
lower order about the mean : 

^^ = ju,i' — J = o, since /ij = d; 

H-s^t^z — 3^/*2 — ^^• 

/*4 = /*/ — 4d fi^ — 6d^ fiji — d*. 

These formulas for transferring the moments from a fixed 
point to the mean are arranged in what is called the continuous 
form; that is, they begin with the moment of lowest 6rder and 
proceed step by step up to the fourth rtjoment. 
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Exercises. 

1. Compute the third and fourth moments for the student height 
data at the beginning of Chapter III. 

2. By taking the fixed point of reference at various points show 
that for the data of Student Heights the third and fourth moments are 
least when computed about the arithmetic mean. 

3. Show algebraically that any moment is least when taken about 
the mean. 

4. Show that having the formulas arranged in the continuous form 
does not involve additional labor of computation. 

1 

5. By expanding m^ = — 5 (^' — ^)° according to the binominal 

N 

theorem where x is measured from the fixed point of reference, derive 
the general formula expressing ^^in ^^ terms of the moments about the 
fixed point. 

6. Specialize the formula of (5) for the third and fourth moments. 

71 Compute the third and fourth moments from the data of Table I 
by using the formulas of exercise 6. 

8. Find the first, second, third and fourth moments about the 
mean of a distribution with frequencies proportional to the successive 
terms of the expansion of the binomial (p+p)'^. 

Ans. fi^^^npq (p — q) ; fi^^=npq (3(n— 2) pq^l), (See Hardy, 
"Construction of Tables of Mortality," p. 107 et seq.). 

The computation of the moments about the mean either 
directly or by first computing about a convenient origin and 
then transforming to the mean is open to the serious practical 
objection that there are no convenient methods of checking the 
results. The arithmetic of the following summation methods is 
comparatively brief and admits of satisfactory checks on the cor- 
rectness of the results. 

The First Summation Method of Computing the Mo- 
ments. The theory of this and the second summation method 
which follows immediately after it are somewhat detailed but 
both are entirely elementary throughout. 

Let us take a distribution with the five frequencies yi,y2,y3f 
y^, 3^5 corresponding to values of x equal to i, 2, 3, 4, 5. By the 
ordinary direct method, the first moment about the point 
x = o is yi + 2y2 + sys + 4^4 + 53'5- Now let us arrange the 
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y's in vertical order and add in the manner indicated in the sec- 
ond column below. 



(I) 




(2) 






(3) 






yi yi+ y2-\ 


- ys-f- 


■ y*-\ 


h Js 3'i + 23'2- 


f 3y8H 


h 4^4+ 5>'5 


yi 3'2-t 


- ^8 + 


■ y^H 


h ys 3'2- 


■f 2y,^ 


f- 3^4 ^ 


h 43'5 


y, ^3 + 


• y4-\ 


h y» ya- 


h 2y, - 


1- 33'5 


y« y* H 


h J's ^4+ 2y, 


ys 3'. y-. 


•S.y ji + 2y^ + 33^3 + ^y, + sy^ ^i + 3^2 + 6>'3 + loy* + ^ly. 


(4) (5) 


yx + 3^2 H 


l-6y,- 


t-ioy. 


+ i5y„ yi + 4y2-^ 


■loy.H 


h 20y« H 


h35ys 


3^2 H 


h33'8- 


1- 6^4- 


+ loy, 3*2 4 


■ 4^3 H 


hioy.H 


h 203-5 


^a- 


1- iy* 


4- (>y, y,-\ 


r 4^4 H 


hioy. 


3-4+ 3% 3'4H 


- 4^5 


y^ y^ 



3'i + 43^2 + 103^3 + 203^4 + 353'5 3'i + 53^2 + iS^s + 353^4 + 703^5 

The sum of the second column is thus the same as the first 
moment. By the direct method the second moment about the 
same point is y^ + 43*2 + 93^3 + ^^y^ + 253/5 divided by N, Let 
us designate the sum of column (2), when divided by N, by 
S; the second divided by N, by 5*2 5 the third when divided by 
N, by 5*8, etc. 

yi + 2^2 + 33^3 + . . . Vi +^362 +663 + . . . 

That is S2 = 9 ^3 = ~, 



N 



N 



3^1 + 4y2 + loyg + . . . y + 5^2 + 15^3 + . . . 

N N 

It is apparent on inspection that 25*3 — S2 is the second moment. 
In symbols, 

2 I 

— (3^1 + 33^2 + 63^3 + ipy^ + I53r6) (y^ + ^y^ + 33^3 + 4^, 

N N 



+ SJs) = — ( Ji + 43^2 + 93^3 + 163;, + 253/5). 

N 

That IS, fi2 = 25*3 — ^2- 
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The third moment about the same point of reference is 

I 

— (J'l + 83^2 + 27/y3 + 643;^ + 125^5) . 

N 

For this moment the following relation is readily verified: 

/i'a = 6S^ — dJg + 5*2- 

Extending the reasoning to the case of the fourth moment, 
we have 

fi\ = 24S, — 36S^ + 14^3 — S^. 

We thus have four relations connecting the moments with 
the S's: 

/i = 'S'2, 

fi\= 65*4 — 6S^ + 5*2, 

fi\ = 245*5 — 365*4 + 145*3 — 5*2. 

Transferred to the mean as origin by the formulas of page 
88 these moments become 

5*2 = d; 

/*2 = ^'2 — ^" = 25*3 — 5*2 — d^ = 25*3 — 

d(i + d); 
f^3 = /s — 3^/*2 — rf^ = 65*4 — dJg + 5*2 — 

3d fi2 — d\ 
6S, — 3fjL2 — 3d(i+d)+d — 

3d fi2 — d\ 

6S,-3^2 (i+d)-d{i+d) 
(2 + rf); 
and similarly, fi^ = 245*5 — 2fi^ ^2(1 -{- d) + 1}- — tH i^ 

(l+d) (2 + d)-l[-d(l+d) 

(2 + d) {3 + d). 

It is evident that the same relations hold for a larger num- 
ber of classes than the five which we have assumed for the pur- 
pose of illustrating the method. 

These relations connecting the moments about the mean 
with the sums obtained by this process of summation are ma- 
terially shorter and more convenient than the direct formulas. 
It will be noticed that the sum of any column is the largest 



lOO INTRODUCTION TO MATHEMATICAL STATISTICS 

number in the next column, so that a satisfactory check on the 
summation is afforded. It is possible, however, by taking the 
point of reference near the mean, to still further shorten the labor 
of the computation. 

The Second Summation Method of C(»nputing the Mo- 
ments. To illustrate the second method let us take a distri- 
bution of eight classes and assume the fixed point of reference 
at class 5. Then we sum from both top and bottom to, but not 
including, the frequencies of class 5 in accordance with the fol- 
lowing scheme. 



(I) 


(2) (3) 


■ (4) . . 


yi 


yi yi 


y* 


Xj 


y. + yi ^2 + 2^1 


yi + 3yi 


ys 


ys + y» .+ yi y» + 2y2 + syi 


ys + 3y2 + ^i 


y* 

It 


y4 + ys + y2 + yi y* + zy^ + 3y2 + 4yi 




3'o 

y« 


y« 4- yr + ys y« + ^y, + ^y^ 


ye + 3y7 + 6y. 


y-i 


yr + ys yi + ^y^ 


y, + 3ys 


ys 


ys ys 

(5) (6) 


ys 




yi 


Vi 




y2+ 4yi 





3^6 + 43^7 + loyg ye + 53^7 + iS^s 

3^7+ AVb 3^7+ 53^8 

" * ys 3^8 

Forming /i'l, about the point x = 5, by the direct method 
we have 

I I 

ft'i = — (3^4 + 2y3 + zy^ -f 4y0 + — (ye + ^Vi + J^s)- - 

N N 

But ft'i has been defined as equal to 5*2- Hence 5*2 is obtained 
by subtracting the last upper summation term from the last 
lower term in column (3). 



; : I ■ ■■ 
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By direct computation, 
I 

N 
But fi\ = 25*3 — 5*2, or 25*3 =/^'2 + ^2- 

I 
Hence 25*3 = — (3;^ + 4^3 + 93^2 + 16^1 + y^ + AVi + 9y%) 

N 
1 

(y^ + ^Vz + 33^2 + 4^1 — 3^6 — 2^7 — 3^8), 

N 

= — (23^8 + 63f2 + 12.V, + 23;, + 63^7 + 123^8). 
N 

I I 
Therefore, 5*3 = - ( Vg + 33-2 + 6v,) H (3;^ + ly^ + (>y^^, 

N N 

That is, 5*3 is the sum of the last term in the positive, or lower, 
summation and the last but one (the last term as written in 
the scheme) of the negative summation terms in column (4). 

Likewise, 

I I 

^4 = — (^6 + 4V7 + 103^8) (3^2 + 43'i), the difference 

N N 

between the last positive summation term and the last but two of 
the negative summation terms in column (5). 

I I 
And 5*^ = — (3f« + 43^7 + 153^8) H Vi the sum of the last 

N N > 

positive summation term and the last but three of the negative 
terms in column (6). 

After the S's are obtained the formulas of page 91 are ap- 
plied to obtain the /a's. 

. As in the first summation method the partial summations 
can be added for checks on the Arithmetic 

This second summation method will be found very con- 
venient, especially when the number of classes is large or the 
frequencies are of considerable size. 
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The following computations for the data of page 24 illus- 
trate the two summation methods. 

Computations of this length should never be attempted 
without first arranging a complete form with a place for each 
number and that place so chosen that the number is in its most 
convenient location. The entire computation should be planned 
before the arithmetic is begun. 



Vlass, 


Freq. 










1 


2 


750 


5925 


28463 


105421 


2 


10 


748 


5175 


22538 


76958 


3 


11 


738 


4427 


17363 


54420 


4 


38 


727 


3689 


12936 


37057 


5 


57 


689 


2962 


9247 


24121 


6 


93 


632 


2273 


6285 


14874 


7 


106 


539 


1641 


4012 


8589 


8 


126 


433 


1102 


2371 


4577 


9 


109 


307 


669 


1269 


2206 


10 


87 


198 


362 


600 


937 


11 


75 


111 


164 


288 


337 


12 


23 


36 


53 


74 


99 


13 


9 


13 


17 


21 


25 


14 


4 


4 


4 


4 


4 



Totals 



750 



5925 28463 105421 



329625 



6*, = 7.9 5, = 37.95 5*4 = 140.56 5*5 = 439.5 



(1); rf = 5*,= 7.9 

(2). rf(l + rf) = 70.31 

(3). rf(l + rf) (2 + rf) =696.069 
(4).. 3(1 + rf) =26.7 

(5). 4(l + rf)+2=37.6 

(6). 6 (1+d) (2 + rf)— 1 = 527.66 



<r= V/i, 
M, = 65*4 — Ma.(4).— (3). 
M4 = 245*. — Ms (5) - Ma . (6) . ~ (3 + <^) . (3) . 



Ms = 25-, — rf(l + rf) = 5.592 

2.36 
—2.015 
85.937 



Notice that the sum of the first column = 750, the last 
sum at the top of the S^ column, and similarly for each following 
column. The computation form is arranged for the use of a 
''Millionaire" calculating machine. 
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The second summation method is started as follows: 



Class 


Freq. 












1 


2 


2 


2 


2 


2 


2 


2 


10 


12 


14 


16 


18 


20 


3 


11 


23 


37 


53 


71 


91 


4 


38 


61 


98 


151 


222 







57 


118 


216 


367 






6 


93 


211 


427 








7 


106 












8 


126 


433 


1102 


2271 


4477 


8085 


9 


109 


307 


669 


1269 


2206 


3608 


10 


87 


198 


362 


600 


937 


1402 


11 


75 


111 


164 


238 


337 


465 


12 


23 


36 


53 


74 


99 


128 


13 


9 


13 


17 


21 


25 


29 


14 


4 


4 


4 


4 


4 


4 




. 750 


1102 


2271 


4477 


8085 





5'.=:(1102~427)/750= 0.9 
.9, = (2271 -\- 367)/750 = 3.52 
5*4 = (4477 -■ 222 ) /750 = 5 . 67 
S, = (8085 + 91)/750 = 10.9 

The computation from this point on is the same as under the 
first method, except that the origin is at class 7, or the height 
class 67, instead of height class 60. 

Exercises. 

9. Compute the moments for the frequency distribution of page 
29 by the two summation methods. 

10. Demonstrate the proof of the two summation methods for n 
classes. 

11. What difference would result in the computations of the second 
stmimation method if the origin were taken at the eighth class instead of 
the seventh so that the upper sum in the first summation is the larger? 

Correction Formulas for the Moments. All the methods that 
have been proposed for finding the moments assume that the frequencies 
are concentrated at the center of each class while actually the deviations 
are continuously distributed from one end of the range to the other so 
that there is nothing in the nature of the data to correspond to the 
classes, mid-ordinates, etc. A certain degree of error is therefore intro- 
duced by these methods. We are not really working with the actual 
deviations but with the artificial classes built up from the actual devia- 
tions. In how far then are facts, which hold for the classes, of sig- 
nificance for the actual variates? It may well be that in ordinary statis- 
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tical work the closeness of the measurements may not warrant taking 
these errors into account but the corrections are easily applied and fre- 
quently make a significant difference in the results. However the cor- 
rections should not be applied to data not accurate enough to warrant 
such care no matter if the corrections are easily applied. The methods 
adopted in computation must never he such as to presuppose more accu- 
rate data than that in hand. 

When the distinction is made between the moments as calculated 
from the class frequencies and deviations and the moments calculated 
under the assumption of continuous variation, it is customary to denote 
the values as computed by v,, i'„ v,, v*, and ^i', v^, vt\ va\ and 
reserve the corresponding A*'s for the values under the assumption of 
continuity. When no account is taken of the distinction between the 
discrete and continuous series of frequencies, the m's alone are used. 
The "'s are often spoken of as the raw or unadjusted moments and 
the /a's as the adjusted moments. 

The adjustment or correction formulas are : 



/*1 




^1 


= O 






/*2 




^2 


-iV 






/*3 




yz 








/*4 




^4 


i^s 


+ 


iJd 



The theory of these corrections is due to Dr. Sheppard 
and to Professor Pearson. A simple demonstration of the 
formulas is that of Bio. iii, p. 308. 

According to the underlying mathematical theory these cor- 
rection formulas hold in strictness only for a frequency curve 
with high contact at each end. When these conditions are not 
satisfied it is probably best not to apply the corrections. 

Theorem I. Changing the unit of measurement of the 
deviation; that is, multiplying each deviation by a constant, 
multiplies a moment by that constant raised to a power equal to 
the order of the moment. For, 

I 

/^ = — 2^"3' 

N 

and %{rxYy = r^S^r^^;. 

Theorem II. Multiplying or dividing each frequency by a 
constant does not change the moments. For, 

^x'^ry rXv^y ^x^y 
Sr^; r'%y 2v 
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Because the values of the third and fourth moments depend 
on the unit of measure of the deviations it is usual to employ 
these two moments in the forms p^ and P2* respectively, where 
Pi =,/^Va*2* ^"d P2 = y^Jy^' To show that p^ and p^ are inde- 

pendent of the unit of measure of x let us write p^ «= 

Clx^yy 
N(^x*y) 

and P2 == . Then let .r be changed into rx where r is 

i^x^yy 

any constant. 

Nil,x^yy,r^ N(%x^yy. 

This give^ i8,= = , and similarly for p^, 

(:S,x^yy,r^ {l,x^yy 

Exercises. 

12. Show that adding a constant to each deviation changes the 
moments. 

13. Show that adding a constant to each frequency changes the 
moments. 

Summations. The following exercises are intended for 
practice in using summations and should be carefully worked 
through in order that a comprehension of the somewhat de- 
tailed discussions of subsequent chapters is not hindered by a 
lack of familiarity with the necessary algebra. 

Exercises. 

14. Show that the square of 2^y is S^V + SS^Bya^t^t where 
the subscripts are attached in the second summation to indicate the prod- 
uct of unequal deviations, and all deviations are measured from the mean. 

15. By actually computing the separate value of each summation 
verify the relation (S^y)*^ 2-*^y* + ^SS^r^y^jr^y^ for the distribution 
1. 2, 5, 2, 1. 

16. Establish the relation (2jry)' = 2^y + 322;r3V^t3'f 

17. Establish the relation (Jjry)* = S^rV -f i'SSx^\*x^y^ -\- 
^%'%x*y*x *y *. 

18." Show that (^xy){^x'y) =%xY -^^^x.y^x^^y,. 

19. Prove that x\ + x\ > '^x^x^ and hence 2 {x*^ + ;r%) > 2^x^x^. 

20. Show that (2jr*y) (2^^,) > 2 {x'yY, 

We have 2jr*y = x\yx + jr^ya + x^y^ -\- . . . » . , , 

and (2y)(2;r*y)=jr*iy'i + jr*2yV+;r*sy%+ . . . .+ x\yiyt + 

x*^y^y^-^ . . . + ^Vvzyi + ;r*ay,ya + . . . . -f . . . 

= 24rV + ^x\y,y, = 2^y + ^y^y^ (x\ +x\), 
Also (^x'yy= 2^V + 22:r'^v,yt =^xy + 22y^y^,r*,;r"^. 
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Therefore (^x*y) (2y) > (^x'yy, 

if 2^V + 23,,3,, (x\ +x\)> 2;ry + 22y,y,x*,x',. 

i. e. if ^y,yt(^\ + x\) > 2i:y,y,x\x\. 

But the sura of the squares of two quantities is always greater than 
twice their products and hence each t«rm on the left is greater than the 
corresponding term on the right, thus proving the theorem. The algebraic 
discussion may be more easily followed if a summation of only two or 
three terms is first employed. 

21. Prove that i^x^y) (^x*y) :>(^x^yy. 

22. Prove that jS, >i8i. 

The Moments and the Equation of the Smoothed Curve. 
It is shown in Chapter II that a smooth curve is fitted on the 
basis of principles which are assumed true for the data as a 
whole. One such principle is that of equality of area which 
assumes that the area under the curve is equal in numerical 
value to the total frequency of the distribution. 

The principle of equality of moments assumes in addition 
to the equality of area and total frequency that the first, second, 
third and fourth moments computed directly from the data are 
respectively equal to the first, second, third and fourth nloments 
computed from the adjusted frequencies. 

To illustrate the application of the method of equality of 
moments let us fit a straight line to the points (2,4), (3,3), 

(47), (5.6). 

The equation of the required line is y = mx -\- b where m 

and b are to be determined. The adjusted y's in terms of m 

and b are 2m + b, 3m + b, 4^ -^ b, $m-\- b. 

The equality of the area and the total frequency can 
be expressed as an equality of moments if the moment of 
zero order is permitted. This is possible because any num- 
ber with an exponent zero is equal to unity. Hence 2** .4 + 
3°.3 + 4^7 + 5°-6 = 4 + 3 + 7 + 6= 20. Also, 2°.(2m + b) 
+ 3'-(3w + b) + 4°.(4fn + b) + 5^.(5^ + b) = 14m + 4b. 

Hence, on equating the two zero moments, 
14m -\- b = 20. 

From the first moment, 2.4 + 3.3 -|- 4.7 -|- 5.6 = 2(2m -f- fc), 

+ 3(3»w + fc) + 4(4^ + b) + 5(5^ + b), we have 

Sim +146 = 75. 
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Solving these two moment equations simultaneously we have 

w== 2.5 
^ = — 375 

Therefore y = 2.$x — 3.75 is the required equation of the 
straight line fitted to the given points on the basis of the asump- 
tion of equality of the zero and first moments respectively. 

Exercises. 

23. Fit a straight line to the preceding illustrative points on the 
assumption of equality of the first and second moments respectively. 
Should the resulting equation agree exactly with the equation found 
above ? 

24. Fit a straight line to the points (1,6), (3,8), (4,6), (5,5), (7,10). 

25. Fit a parabola, y = a + tjr + ex*, to the points of Exerc- 
cise 24. 



CHAPTER XII. 

FURTHER THEORY OF CORRELATION. 

A Second Concept of Correlation. In Chapter VII two 
attributes are said to be correlated when there is a tendency for 
a change in the value of one to be followed by a change in the 
value of the other. And the ratio of the standard deviation of 
the means of the arrays to the standard deviation of all the 
variates was taken as the measure of the degree of correlation 
between the attributes. A second approach to the matter of cor- 
related variates is as follows. 

On the assumption that the mean is the representative of 
the variates of an array the dependence of y on jr is exhibited 
by the curve of means ; that is, by the regression curve. Obvi- 
ously this curve is a significant measure of the dependence of 
y on X only insofar as the means are in fact representatives of 
the variates of the respective arrays. Within this limitation 
the spread of the variates about the means of the successive 
arrays is a measure of the extent of dependence of 3; on jr; that 
is, of the correlation of y with x. 

Let <3"ay denote the mean squared divergence from the re- 
gression curve. Then 

N<Ja, - %%n^{y — yj)\ 

As is explained in Chapters VIII and IX, this mean squared 
deviation must be divided by cry^ in order to obtain a correlation 
index of value for purposes of comparison. 

To reduce O'ay^ we write, 

^%n^{y — y^y = ^^n^y{y — S + y — yx)\ 

^^^n^^y — yy 

+ 222«xy(y — 5f)(5f — 5?,) 

+ S2w^(5^ — 5^,)^ 

+ 222n^y(y — 5?)(5? — 5?0. 
(108) 
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But 25Sn^y(y — y)(j? — 5^x) ^ 2:^^(y — yr)%jn^j(y — y) 

= 2My—Sx)f^(Sx—y) 

= — 2Sn,(3^, — 5f)2 

= — 2NrfW' 

2 

Therefore N(^l = Nuy^ — Nrfay\ and — =i—-q^. 



On division by iVO'y*, we have, 



G 2 
y. 



2 

_1=:(I_^2). 

(J 2 ^ 
y 

It accordingly appears that this second measure of correla- 
tion is not independent of the first, being equal to i — rf^. 

Exercises. 

1. Show that the mean spread of the variates about the regression 
line of Chapter IX. is equal to (1 — f*)^j» 

2. Show that (Toy^ is zero for perfect correlation and equal to «'y* 
for zero correlation. 

3. Discuss the convenience of the computation formula V -- 1 — 

^V/^y'- 

4. Which approach to the numerical measure of correlation seems 
clearer ? 

Derivation of the Equations of the Regression Lines. 

Let the regression equations be of the form : 

^x= byx.x + a. 
and ^y = bxy ' y-\- c. 

The moments of order zero and unity give for the regres- 
sion of y on X two moment equations : 

Sn^j!, = fcyxSn, . x + Na, 
and S«^x . ^ = fcyx2»x . ^^ + a2«x . ^. 

As explained on page 75 the moments are computed for each 
individual frequency of an array. Hence we have 2 n, . j^^ 
and not merely 5yx. For the same reason a appears in the 
moment sum once for each frequency; that is, N times. 
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I I 

Since j^x =-= - Syn^y . y, SxWx5'x = 5xWx • — Synxy . y = 

= 2x 2y fixj y, a first moment about 
a horizontal line thru the mean, (both y and x are assumed to 
be measured from their respective means) and hence zero from 
an obvious extension of the theorem of page 35. 

Likewise Sn, . x = o. Hence, from the first inoment 
equation, a as equal to zero. 

In the second equation we have, 

The summation Sw, . x- has been taken equal to N^j? and 
Swy . y- equal to N^y-. It seems consistent with this notation 
to assume 22Mxy3'.r= A/^raxCTy where r is the numerical constant 
of Chapter IX. 

On reducing the second moment equation we have 

Nrorx(Ty = &yx . N<Tx^ 

<Ty 

Therefore byx = r . - 

O-y 

and hence j^x == ^ . — . jt is the required regression equation. 

C^x 

Exercises. 

5. Derive the regression equation ^y = r — y. 



't 



6. Prove in detail that S2njjyy = o, where x and y are measured 
from the mean. 

When X and y are measured from the original axes the 
regression equations become 

yx—y ==r— {x— x) 

Xy — x = r~- (y — y). 

The Relation Between 17 and r. It was shown in Chapter 
IX that rj and r have the same numerical value when the regres- 
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sion is* truly linear. Hence a lack of agreement in the values 
of rj and r is an indication of a divergence from linearity in the 
regression. The difference between rj and r is expressible in terms 
of the divergence from linearity by the two equations : 

Na/irj/ — r^ = 2w, ( F, — y,) ' 

and Na^'irj^- — r^) = %ny{Xy — x^y, where Fx and Xx are 

the regression line means. 

To prove the first of these formulas let us add and subtract 

y for each term in the summation 2w^(Fx — y^y. We then have 
after expansion, 

On substituting from the regression equations this expanded 
form becomes 

_ o-y^ o-y ^ 

2nx(x— A-)2 . r^ — + 2Wx(5'x — j')- — 22nx.r— (5!x— V) (^— ^), 

which equals 



CTx^ o-x 



<7v' ^ 



y 



Na^- . r- (- N(Jy- . iyy2 — 2r — A^r^x^y. 

<Tx' ^x 

Substituting for the last term and collecting we have 
^n,(Y,— y,y = N<Ty'(riy' — n. 

Exercises. 

7. Prove the formula 2ny(X — Jy) = N<r^^ {"n^ — f^). 

8. Show from these formulas that 17 > r. 

9. Show that the same pair of equations will be obtained for the 
regression lines if the assumes lines are fitted to the individual frequen- 
cies instead of to the means of the arrays. 

10. Prove that for truly linear regression (Jn^ = rdy, 

11. Show that for truly inear regression a'4ry^ = (Ty*(l — 1^). 

The Coefficient r for Non-linear Regression. It has been 

seen on page 85 that r is always too small for any but strictly 
linear regressions. This is due to the fact that the summation 

22«xy(^ — ^) (3^ — .'v) involves both positive and negative terms 
which cancel each other with a consequent reduction in the value 
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of the summation. If the regression curve is carefully drawn 
a fair idea of the trustworthiness of r can be obtained by ob- 
serving the departures of that curve from linearity. A more ac- 
curate way, of course, is to compute both 17 and r and observe 
the difference in value of the two constants. 

22«xy(^ — ^)(y — .f) 

bmce r = 

the size of r varies directly as the value of the summation in the 
numerator. In this summation the large values of both x and y 
are along either diagonal and hence r will be largest numerically 
when the values of ft^y are largest along a diagonal. If the fre- 
quencies tend to lie along one diagonal the value of r will be 
positive; along the other, negative. If the distribution should 
exhibit two tendencies, — to concentrate along both diagonals — 
the cancellation of terms with opposite signs would give rise to 
a small value for r. Also the regression may be markedly non- 
linear, circular, or periodic as a sine curve, so that the straight line 
fitted to the means of the arrays is practically horizontal, resulting 
in a very small value for r. This may be true even for data 
which shows a definite tendency for the frequencies to cluster 
closely along the curve of means; that is, it is possible for r to 
have a small value even though the data shows the attributes to 
have in fact a high degree of correlation. In any of these or 
similar cases the correlation should be determined from the 
correlation ratio which is not aflfected by the form of the regres- 
sion. 

The Most Probable Value of a Characteristic can be de- 
termined from r. Let us first define the properties, homoscedas- 
ticity and homocUsy. 

The mean standard deviation of the frequencies of an 
array has been denoted by the symbol o'ay^ where o'ay^ = cry^ 
(i — 17^) or in terms of r, (Jay^=^y- (i — r^). It must be remem- 
bered that these are mean values so that it may well happen 
that the true standard deviation of an individual array ma} 
differ considerably from these values. A distribution in which 
all arrays of a given sense, that is, all y or all x arrays have 
the same standard deviation is said to be honioscedastic with 
respect to the arrays of that sense. 
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It has been assumed that the frequencies of the arrays 
are so distributed that the means and the modes coincide; 
that is, so that the mean is the most probable value of the 
array, but this may not always be even approximately true. 
The arrays of a distribution are said to be homoclitic when 
the mean is the most probable value of the array. 

On the basis of the just preceding definitions it may be said 
that for homoclitic arrays the most probable value of y corre- 
sponding to a given value for x is found from the equation 



y = r — ^, or 




y — y — r — {x - 


-X). 



A knowledge of the most probable values is of little im- 
portance unless accompanied by information of the dispersion 
about that value; that is, of the standard deviation and the 
probable deviation. Since the entire theory of estimating values 
of a characteristic is based on the coefficient of correlation the 
probable deviation of y when obtained from the regression curve 

is logically 0.67459 o-y V(i — ^)» and not 0.67449 o-y V(i — v^^ 
(provided the arrays are homoscedastic, otherwise no general 
formula is possible and the dispersion of each array must be com- 
puted directly from the data of the respective arrays). Like- 
wise' the probable error of x found from the regressions is 
0.67459 OTx V(i — ^^)» with the same restrictions as to homosce- 
dasticity. 

If the three conditions of linearity of regression, of homos- 
cedasticity, and of homoclisy are satisfied the just preceding 
theory of estimating the value of a variable characteristic is com- 
plete and practically valuable. In ordinary distributions these 
conditions are likely to hold at least approximately so that when 
intelligently applied the theory is of importance. In every case 
the regression curve should be determined graphically and both 
rj and r computed and the difference in their values noted, and 
the test for linearity applied. If there is doubt as to the 
homoscedasticity, the standard deviations can be computed di- 
rectly from the arrays in question and the probable deviations 
determined from the resulting values instead of from the pre- 
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ceding formula. The question of homoclisy is usually disre- 
garded though wide departures should be noted and taken into 
consideration. 

Exercises. 

12.* What is the most probable weight of a student of height 70 
inches ? 

13. What is the most probable height of a student of weight 132 
pounds ? 

14. What is the most probable rainfall of a month with a mean 
temperature of 54 degrees? 

15. What is the most probable top beef cattle price for a month 
with a top hog price of $8.25? 

16. Compute the probable deviations from the most probable values 
of Exercises 5, 6, 7, and 8. 

17. Discuss the practical reliability of the preceding estimates. In 
how far is the probable deviation a trustworthy index of this reliability? 

If two distributions are superimposed the value of r for the 
combined group is connected with the constituent r's by the following 
relation. 

N 

The proof of this equation is left as an exercise. 
Two superimposed distributions with the same means have the 
simple relation of connection with .the r's : 

N Vx^ = Ni^i^,/i + N.cry2^2^a. 

form which the effect of various mixtures of data can be readily traced. 
For instance, if the second distribution has a constant frequency for each 
subgroup, ra is zero and the value of r is smaller than that of n in the 

proportion . That is, adding a constant to each frequency 

decreases the value of r. 

The e£Fect of multiplsring each frequency is readily determined. 
Let n^j be replaced by an^^ and tty by an^. 

SSn^y . X . y . 



N 

r= becomes 

S» -r* S» y' 



N N 



* Exercises 12, 13, 14, 15, refer to data already given. 
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a22n_ „ . x . y . 



xy 






i.Jii . . V 

I. ■,,'■ 



aN 


d^n^ . 


X" . 


d^n^y* 


aN 
22n^y . X 


aN 

-y 



as before. 



2n^ . X* . 2ny . y^ 
That is, multiplying each frequency is without effect on the value of 



r. 



Correlation of indices. A mathematical measure of spurious cor- 
relation win now be derived. 

Let us take the two series of measurements x and 3; and let 
I^y:=x/y be the corresponding series of indices. 

The mean / cannot, except under certain conditions, he obtained by 
dividing th^g t^ean of the x*s by the mean of the y's. For, by definition 

1 X 

N y 
Transferring to the respective means, x and y, as origins we have 

1 x + dx 
/^ = — 2 , where the *'s denote merely the new variables 

N y + Sy 
measured from the respective means. 



On rearranging 



jty 



1 

-3 

N 



Sx ^ 

1+- _ 



X X 



8y y 

y 



1 X ( ^^^ ( h^ -1 

-•--2 1 + ^ 1 + r- » 
N y I ^J I y J 

1 X ( Sx dy dxSy^ 

^._.,2a+----~-_L 

JV y I X y xy ) 



as far as second terms. 



But ^Sx = 2«y = ; SxSy = Nra-xffy ; 25y z= Ntr^* and 21 = N, 
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Hence, /^ = - 1 + -^^ ^^_^ 



= ^x?|l+— :^ 

y xy 

This formula shows that the error in assuming Ixf'^^xy nwiy be 
considerable. It is smallest where the data is perfectly correlated and 
largest for uncorrelated material. 

Exercises. 

18. Let indices be formed from the two series of measurements 

12 3 4 5 6 7 

13 4 3 2 7 4 

Find the value of /,y and compare with I^y^ 

The Standard Deviations Cj of the index I^^ is given by the equa- 



tion 



The standard deviation will be computed about the corrected mean. 
We have 

fX + Sx X r r^^c^ ffy»A . « 

On dropping the cross term, 

= 1*552 I h squared terms I 

l^ y J 



= l'}n' 



fd-jtr* cy ex <yy^ 

X* y ^ y J 



Exercises. 

19. Prove that exactly the same formula is obtained for the standard 

deviation about the unadjusted mean — , 

y 

The theorem just stated as an exercise shows that in so far as the 
standard deviation of a distribution of index number is concerned, it is 
immaterial whether the index of the means, x/y is corrected or not. 
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By a method very similar to that employed in the case of the stand- 
ard deviation, it may be shown that the coefficient of correlation be- 
tween indices is given by the formula 



r 


I- 1 


ff a 

X 1 

r 

X i 


xz 


X w 

X w 


T 

' xw 


c 

7 

y 


z 


^^ 


+ 


J 


w 


Tyw 




























P 


y 


* xy 


a a 

X y 

xy 




+ 


w 




2r 


^^w 




iw 



When the four variables x, y, s and w are uncorrected the value of 
^j.jis zero. When the bases are constants ^^z=^^z=:0, and we have 



^x^ 



X s 

XT ' ZS 



^1:1 — . r„ =r 



X s 



4r s 



That is, dividing each value of a characteristic by the same constant 
does not affect the value of the coefficient of correlation. 

As a special case of this result, if the absolute values of two char- 
actetistics are correlated the degree of correlation is not changed by 
expressing the measurements as percents. 

When the bases 3; and tv are identical T j.j takes the value, 



<r_ o_ <r_ <r„ no <r * 



X t X y 7 % 7 

'Vxz — rxy ry2-\ 

X 2 X y y i 7 

^i:i= =z— zi— izzizzzi: 

\J — + 2^-rxz\ \\ — + 2 ryA 

'^ P ? X 2 i ^ y* ~^ z y ) 

Now let X, y and z be entirely uncorrelated. Then 



<r_» 



^1:1- 



rJT' V^\ fT' — 7^ 



This last value of T^.j may even equal 0. 6, in the case when 



(T v e 
y X 1 



Il8 INTRODUCTION TO MATHEMATICAL STATISTICS 

Hence by dividing each deviation by a third variable, it is 
possible to introduce correlation into strictly uncorrelated tna^ 
terial to as great an extent as 0.5. 

Care must therefore be taken in dealing with index num- 
bers that the full value of r is significant for the absolute 
values of the measurements. Bjr computing r from the formula 



a/ 






X y z- f 

the value of the greatest possible degree of spurious correlation is 
obtained. A value of r greater than this value is certainly 
significant; a value less may be significant but must he accepted 
with caution. 

Since by the formula the spurious correlation is zero zvhen 
the standard deviation or variability of y is zero, it follozvs that 
th,e base of a system of index numbers should be as nearly con- 
stant as possible. 

* A theory of spurious correlation might be developed for the 
correlation ratio but the algebraiac details are so much more 
workable for the correlation coefficient that it would be hardly 
worth the extra eflFort. It is conceivable that such a theory 
would be practically necessary but it is unlikely because after 
all only approximate results are valuable. There would be little 
of value in attempting to measure the degree of spurious cor- 
relation with precision. 

It must be remembered that the matter of spurious correlation is 
essentially one of interpretation. The question is what does correlation 
mean. The correlation is actual and real for the indices hut it may be 
spurious in so for as the absolute values of the measurements are con- 
cerned. 



CHAPTER XIII. 

, ^ . . . , , , 

THE METHOD OF CONTINGENCY. 

THE CORRELATION OF NON-QUANTITATIVE 

CHARACTERISTICS. 

The Mean Squared Contingency. — Both the correlation 
ratio and the correlation coefficient are based on the variation 
in the means of the arrays; the first directly and the second 
through the straight line fitted to the points located by the 
means. "Another ili6thod' of measuring the degre of connection 
between the attributes, described and illustrated in this chap- 
ter, is based 6n elementary notions of probability. It may be 
stated in beginning the discussion 6i the method of contingency 
that not the least important value of a study of the subject is the 
additional insight which it gives into the fundamental nature of 
correlation. 

In Table I the distribution of height without reference to 
weight is given by the total frequency of the y arrays; that is, 
by the totals at the foot of the table, and the distribution of 
weight, without reference to the distribution of height by the 
column of sums at the right in the table. 

iWhw there is no tendency for certain weights to be most 
often associated with certain heights, the frequency of a sub- 
group should be proportional to the total frequencies of its 
two arrays. Thus imagine the frequencies of the subgroups 
erased from Table VIII of Chapter VII and then filled in en- 
tirely at random; that is, without bias or selection. Since 
110/750 of the total frequency of the distribution appears in 
the height array of weight type 137; that is, since no in- 
dividuals out of 750 are of weight 137, it is logical to assume 
that this height array contains 1 14/750 of the frequency of each 
array which it crosses The frequency of the subgroup (68,- 
137), for instance, should be 110/750 of 126. And in general, 
when the individi4als are placed at random, the frequency of a 

«^ . My 

subgroup is given by the formida . For the y array of 

(119) 
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type X contains — of the frequencies of each array which it 

N 

crosses. The frequency of the x array of type y is n^. Hence 

the subgroup of intersection has a frequency, — . ttj, whieh 

N 

Hx . fly 

equals . 

Now if in the actual distribution the frequency of a sub- 
group, n^y, is larger or smaller than the random selection fre- 

quency given by the formula Wx . — , the divergence must be due 

N 

to the presence in the data of a tendency for certain values of 
the attributes to be most often associated and hence the total 
extent of this divergence is a measure of the degree of the asso- 
ciation or correlation in the data. This method of measuring 
correlation is called the method of contingency. 

^ . fh 
The difference n^y — is squared to prevent the can- 
celling of positive and negative values. Since only the relative 
size of the differences is significant, this square is divided by the 

above random selection frequency . On summing all such 

values, we have the mean square contingency <^.^ 

N 



Wxy 



where N^^ = 22 



nx«y 

IT 



On expanding and reducing, this summation is arranged 
in a more convenient form for computation. We have 

N j n^y- n^fty 

- =N 2nxy + 



n^fh Wx«y N 

If' 
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/ 



Wxy 



"^y 



N , 
and hence 22 = A/^2 2N + N, 

«xMy I I 

since 55 = — 2«, 2«y = — Sn, AT = 2«x = AT^. 

N N N 

n-xy 
Therefore <^2 =r 2 i . 

Wx«y 

The probable deviation of <^ is discussed at length by Pear- 
son and Blakeman*. 

Exercises. 

1. Compute the value of for the data of Table VIII. 

2. Why not divide the square of the difference for each sub-group 
by the actual frequency of the sub-group instead of the frequency under 
the assumption of no correlation? 

Properties of ^ . In data selected entirely at random ; that 

«x«y 

is, where nxy = for all values of x and 3', the value of <l> 

N 

is of course zero. It does not necessarily follow, however, that 
for absolutely uncorrelated material; that is, for data having 
Yf^z^ffy =r o, the value of ^ must be zero. 

A moment's consideration will show that the greatest value for 

22 , taken over the subgroups of any one array, is unity and 

n^Hy 

that this greatest value cannot be attained unless the subgroup of 
intersection is the only subgroup with non-vanishing frequency in 
either of the two arrays intersecting in that subgroup. It follows 
that, if the distribution is not square, the number of arrays giving 
the maximum value cannot be greater than the number of the 
longer arrays. Hence in symbols, if r and s are the numbers of 
arrays of the respective attributes and r = s or r <, s, the greatest 
value for <^^ is, r — i. 



♦Biometrika. Vol. V, p. 191 et seq. 
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For illustration, in the table 

abed 
e f g h 
i j k 1 

at least one horizontal array must have more than one non- 
vanishing subgroup frequency. Let this table be 



Then 2 







a 


O 










o 


f o o 
o k 1. 




m 

ftxfty 


a" 

= - + 
a. a 


- + + 






1 + 


1 + 


k I 

+ 





k^l k+l 

Exercises. 

3. Show that for =0, the means of the x and the y arrays lie 
on vertical and horizontal lines respectively. 

4. Show by actual substitution in the formula 0* = 2 1 that 



**x**y 



= when n^^y = 



ftj^ffy 



N 



5. Verify the just preceding theory by assig^ning different combina- 
tions of values to the symbals a, b, c, d, e, f, g, h, i, in the distribution : 

a b c 
d e f 
g h i 

6. Do the same for the distribution 

a b c d e 
f g h i j , 

7. Show that the greatest value of 0* for a table of the form of 
Table VIII is 13. 

8. Give an algebraic demonstration for this theory when applied to 
a general distribution r by ^ fold. 
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The greatest disadvantage of <^ as a measure of correla- 
tion arises from the fact that its value depends on the number 
of arrays in the distribution so that it is almost entirely useless 
for purposes of comparison Another disadvantage lies in the 
fact that, notwithstanding the logical simplicity and directness 
of the theory underlying the method of contingency, in prac- 
tice the interpretation of variations in the value of <^ is a mat- 
ter of much difficulty For instance, when <^ equals 2.5 what is 
the significance of an increase of 0.5 in its value? How much 
greater is the degree of closeness of association in the latter 
case than in the first? A third objection is that for a large 
table the labor of computation is heavy. 

The first objection above is partially overcome by making use 



a/" 1'- 



of the coefficient of continqency, \ . This constant is 

given added prestige by the following relation. It may be shown 
that for a finely divided distribution of a particular type the co- 
efficient of contingency and the coefficient of correlation are eqtuil 
in value. Consequently in certain forms of distribution this fur- 
nishes a convenient method of obtaining the value of r. How- 
ever, care must be taken to make sure that the assumptions 
essential for the validity of this theorem are approximated to 
with sufficient closeness. Ordinarily it is better to make use of 
methods which do not rest on so extensive assumptions. 

An approximation to the probable deviation of the co- 
efficient of contingency is to take one and one-third the prob- 
able deviation of r. 

Exercises. 

9. Compute the coefficient of contingency for the data of Table VIII 
and compare with the value of r already computed. 

10. Do the same for the data of Table IX, Chapter VII. 

11. By combining arrays in the distribution of Table VII and com- 



puting the successive values of \ show the effect of different widths 

of classes on the value of this constant. 

12. Show that the coefficient of contingency is smaller than the 
value of r computed by the method of moments. 

13. Compare the reliability of the coefficient of contingency for 
highly and for slightly correlated data. 
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14. Compare the labor required to compute the value of <f> with that 
for If. 

In concluding this part of the discussion of the method of 
contingency it may be stated that when the attributes can be 
definitely measured there is no practical advantage in computing 
the value of <f>, 

Non-Quantitative Characteristics. Because the formula 

n xy 

2S does not contain the deviations x and y and contains only 

the frequencies of the subgroups it can be applied to distribu- 
tions in which it is impossible or undesirable to assign numerical 
values to the deviations; for instance, a distribution of hair and 
eye color, of degrees of intelligence in drawing and music. Such 
distributions are said to involve characteristics not quantitatively 
measured or measurable. 

Thus in its fundamental theory the coefficient of contingency 
applies with equal validity to quantitative and to non-quantita- 
tive data. Moreover, since the number of classes in the case 
of non-quantitative distributions is ordinarily small the labor 
of computation is not unduly heavy, and hence the coefficient 
of contingency is of greater practical importance for this kind 
of data than for quantitative data. However it will now be 
shown that for non-quantitative distributions, the correlation 
ratio is a more convenient and satisfactory measure of correla- 
tion than is the coefficient of contingency. 

A correlation problem very similar to that arising from 
non-quantitative data is the finding of the degree of correla- 
tion when the measurements of the attributes in quantitative 
data are classified into very broad classes; to find, for instance, 
the extent of the tendency for under-height and over- weight to 
be associated. Further than the effect that so broad classes may 
have in producing errors in the results obtained by the formula 
for 7] there is no theoretical objection to the direct application 
of the theory of the correlation ratio to a distribution obtained 
by grouping into broad classes. 

However, the theory of the correlation ratio does not ap- 
ply directly to strictly qualitative data and for that reckon 
we shall justify its use for such distributions by showing thai 



INTRODUCTION TO MATHEMATICAL STATISTICS 



125 



in a very important form of distribution, the two by two table, ri 
and <f> are identical and that in other ordinarily occuring case^ 
the values of the tzvo constants are highly correlated. 

Exercises. 

16. Arrange the data of Table VIII in the following form and com- 
pute the values of 1? and i>. 

Height. 



J3 

bo 





Under 68 


Over 68 


Totals 


Under 137 








Over 137 








Totals 










( 





The Four-fold and the Nine-fold tables. We shall now 
derive the formulas for 1; and <f> for a 2 x 2 table and obtain 
the computation formulas for 7; for a 3 by 3 table. The same 
method might be employed to derive special formulas for each 
type of table. In the absence of special formulas the general 
formula for iy can be applied directly. 

Let us take the four-fold table, 



"11 


"21 


"12 


n,. 



We have f = ~- '-- = — ^ 4- — '■ '— = i + 



A' 



.V 



N 



N 



''12 ^'22 



Similarly, 3", = i H , and Vg = i -|- 



n 



w.> 



Substituting these values in the formula, 

o 

N^niy^ w,(y, — v)" + ^2(3'2 — SY^ where ^my- is the 
mean squared deviation of the means of the arrays, 
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We have after some detailed reduction, 



Also A^^y- = 






N 



Therefore r? = 



^11^*22 ^21^^12 



V ^h^2^:l^:: 



Wxy' 



From the formula Na- = 22 i it is readily shown 

by direct computation that <^ =^ — -— - . 

The equality of <^ and -q for the fourfold distribution is 
therefore demonstrated. 

For the nine-fold table, 



Wii 


«21 


W31 


W,2 


"22 


Wa2 


W,3 


W23 


W33 



we have by a reduction similar to that for the 2 by 2 table, 

« . 2 + 2M . 3 M12 + 2Wi3 , «22 + 2W23 

5^ = I H- rr— '5^1 = ^ H r, '5^2 = i + - 



5^3 



1 + 



N 

n^2 + ^^33 

AT, 



N, 



N. 



n.o + 2n.^ 

Let -^-~ = /, 

N 

then, on substituting for y^ and 5?, 



(nx2 + 2Mx3)' 



'Wx2 + 2W 



^ 2 



X.T 



2/ 



n, 



— / 

«x 
'Mx2 + ^Hxh' 



iV + nj' . 
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.^^2 



Therefore, %nj^(yx — y) 



^ ( Wx2 + 2nx3 ) ' , ^ ,^ 2nx2 + 2nx8 

2 2/Ar2 — 



Hx 



N 



(Hxo + 2«x3 ) 2 ( Wx2 + 2nx3 ) ^ 

«x «x 

Similarly :^ny(y — yy= N(l — \-)+ 2n.^. 

^ (Wx2 + 2Mx3)' 



— Ar/2 



Hence n^ =: 



«, 



iV(/ — /2)+2n^ 



Exercises. 

17. Compute the value of 1? from the following distribution of the 
variation in receipts and prices from month to month of live hogs at 
Union Stock Yards, Chicago, from 1901 to 1914. 

Receipts. 



o 
C 

cm 





—50 


—50.... 50 j 50 


-25 


10 


7 


24 


25 25 


20 


32 


26 


25 


37 


5 


6 



The original formula for 0, 



n' 



xy 



0' =z 22 1 



M^fty 



is probably in the case of the 3x3 table as convenient as any 
for the computation of that constant. 

It is immediately evident that in general rj and <f> cannot 
he equivalent for a table larger than four-fold, because there 
are two ly's for each distribution and only one <\>. The follow- 
ing theorems may be stated. 

1. When <f) = o, the value of rj for y on x and for x on 
y are both zero; that is, rjy =!7;x = 0. 

2, When rjy^=^Tjx^= 0, it may ordinarily be expected that 
<f} will be practically zero hut it is not absolutely necessary that 
such be the case. 
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3. When only one 17 is zero, it is most likely that ^ wiO be 
small in value, 

4. When ^ takes the maximum value, rfj=^ff^= i, 

5. When ffj =1.11^1 =z I, ^ takes the maximum value. 

6. When one tf only is unity it is most likely that the value 
of ^ will not differ greatly from the maximum, 

7. There is a close correspondence between the values of 
4f and the rfs for data of all degrees of correlation. 

Discussion of the Theorems. On substituting the rela- 

n% ny 
tions «xr =■ in the formula for ly, and for rfx it follows im- 
mediately that 5' =r y, = ^2 = y and hence that 1;^ zn ly, =^ o for 
^ = 0. 

In regard to theorem 2, it will now be shown that the nine 

relations, n„ •=. , which result for a nine- fold table when 

N 

^ = o, can be reduced to four independent relations, which re- 
sult when ^ = o. That is, if there are four such relations the 
other five must hold true and the value of ^ is necessarily zero. 
In other words, the vanishing of <f> imposes four and only four 
restrictions or conditions on the data of a 3 by j table. 

n^n.^ n^n.i «, . n.j 

For if, «,i=~— -, «2i = -Tr"' ^12— — 7r~' ^"^ 

N N N 

n^n.2 ^s'^i 
«,, = — ~ it follows that «,, = . Let us substitute the 

" N N 

equivalents for «,i and ngi in the equation «3i= w.j — n^ — nji- 

This substitution gives 

n^.n,, n^.n., n.. 



= n^,— —{N — n^) = n.,—n.,+ 



N ■ N 

■" AT 
and similarly for the remaining relations. 

The vanishing of vjy implies the three relations. 



INTRODUCTION TO MATHEMATICAL STATISTICS I29 

»^ii + 2n^n __ fh2+jn^^ ^3, .+ 2n33 _ n.^ + 2n.^ 

«1 »2 '^S •'V' 

It can be readily shown that only two of three relations are 
independent. That is, if the first two relations hold, the third 
is necessarily true. 

If rjx as well as rjy vanish the three additional relations 

W21 + 2M3, «22+2n32 W32 + 2n33 1^ + 211^ 

=: z= =: are im- 

n.i n.2 M.3 N 

plied. Here again only two of the three additional relations are 
independent and of the six relations implied by the vanishing 
of both rjy and tjx it is only a matter of algebraic detail to show 
that only three are independent. That is, the vanishing of both 
i/x and Tjy imposes one less condition on the data than does the 
vanishing of <^. And hence it is not necessarily true that 

<^ = O when rfx=rfy'= o. 

As to the maximum values for these constants, the rela- 
tions i/y = i;z =^ I require that there be but one non-vanishing 
frequency in each array of either sense and hence the condi- 
tion for a maximum value for ^ is satisfied. The converse rela- 
tions are evidently true. 

For only one rf equal to unity, however, the data might be 
arranged, for instance, in the form of the table, 

a o o 

o o o 

o b c, 

when <f>^ would not have the maximum value. 

If a large number of distributions were made up from the 
same population and the values of rj and of ^ computed for each 
distribution, it would be found that in the long run a large value 
of rj was associated with the larger values of <f> and vice versa. 
But to obtain a formula for the correlation of rf and <^ is a mat- 
ter of considerable algebraiac detail and the resulting formula is 
so complicated that it is practically worthless*. For this reason 
the algebraiac discussion of Theorems 2, 3 and 6 is not given in 
the complete form. 

We have outlined the method of showing that the value of 
rj for a non-quantitative distribution has a close connection to the 

♦Compare Blakeman, "The Probable Error of the G>efiicient of 
G>ntingency" loc. cit. 
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value of 4t ; that is, that rf and ^ are highly correlated, for such 
data and hence the correlation ratio may he used to measure the 
degree of correlation or association in the data with all the as- 
surance that attaches to the method of contingency. It is of dis- 
tinct practical advantage to have one coefficient or index of cor- 
relation for all kinds of data and for that reason the coefficient 
of contingency is not greatly used in practice. 

Caution is necessary at one point, however, for data divided 
into only a few classes does not convey the same amount of in- 
formation regarding the correlation between the characteristics as 
does the more detailed material and hence not the same degree 
of confidnce can he placed in the computed value of any constant 
derived from the less detailed table. For this reason compari- 
sons of the values of correlation measures between different forms 
of distributions must be carefully made and due account taken of 
the fact that for the small table the results do not warrant the 
same degree of confidence as do the results from the finely 
divided table. 



Exercises. 



18. Prove that if 



nix = 



«« = 



nu = 



and i%n = 



N 

tf«.W:i 

N 
N 



N 

the remaining five relations of the same type hold. 

19. Show that if Vt = the vanishing of v,^ imposes only one addi- 
tional condition on the data. 

20 Show that if ny =17,^ = 0, the frequencies of the nine fold table, 
can be expressed in terms of the marginal sums and frequency of any 
one sub-group. 

21. Show that in the distribution 2 4 2 i7y = >», = and ± 0. 

1 1 1 

2 4 2 

22. Construct a fictitious table having Vj = i and <f> not having a 
maximum value. 

23. Investigate the relations between v and for a 2 x 3 table. 



APPENDIX I. 

Introduction. The generalized frequency curves of Pear- 
son are so diverse in shape that a curve of this class can be 
found to fit any ordinary statistical distribution. By the 
following methods the fitting of a Pearson curve is reduced 
almost entirely to a matter of routine substitution in formulas, 
so that the practical statistician can make extended use of the 
curves without great familiarity with their theory. 

This discussion is designed both to present the working methods of 
the generalized frequency curves and to give the statistician who has a 
minimum of acquaintance with the higher mathematics some degree of 
familiarity with the underlying theory. The demonstrations are, for the 
most part, omitted. Many of the exercises have to do with the omitted 
theorems and derivations. 

In developing the theory of the generalized frequency curves 
it is logical, as well as practically convenient, to start with the 
normal curve and consider the general distribution as a mod- 
ification* of the normal type of distribution. 

The Slope Property. The particular modification which 
leads to the frequency of Pearson is obtained by generalizing 
the slope condition of the normal curve.** The slope of a 
curve at a given point is the tangent of the angle which the 
line touching the curve at that point makes with the X-axis. 
In the case of the normal curve, the ratio of the slope to the 
ordinate is negatively equal to the abscissa of the point. 

This slope property is generalized by taking the ratio 
equal, not to — x, but to — (^ + «) (& + cjit -f- dx^) where a, b, 

X -f- a 
areequal, not to — ;r, butto — j—j- — ' , ^ where a, b, c, d, arc 

constants. The slope of a curve is ordinarily denoted by the 

dy 

symbol — • 
^ dx 



♦Compare Edgeworth, Jour. Roy. Stat. Soc. Also West, "On the 
Translated Normal Curve," Ohio Journal of Science, Dec, 1915. 

** First extensively treated by Pearson in the article "Skew Varia- 
tion in Momogeneous Material," Phil. Trans. 

(131) 



132 INTRODUCTION TO MATHEMATICAL STATISTICS 

In this notation the generalized slope property is expressed 

by the equation. 

1 dy X -\' a 

y d.v b -\' ex + dx^ 

The Constants, a, b, c, d. The statistical significance of 
each of the constants, a, b, c, d, can be readily determined. 

In Chapter IV, it is shown that the slope of a frequency 

dy 
curve is zero at a mode. Since -^; that is, the slope, is zero 

dx 

when .r = — a, the constant a determines the position of the 
mode. The mode is therefore at a distance, — a, from the mean. 
As explained in Chapter V. a is thus a measure of the skewness, 
of the lack of symmetry of the distribution. For a symmetrical 
distribution a is evidently 0. 

When both c and d are zero the generalized slope equation 

x + a 
is merely the normal slope equation with x replaced by . 

b. 
This leads to the normal curve, 

(x + a)- 

y=^k,e , where ^ is a constant. 

Comparing this equation with the standard normal equation. 



x* 



y = ke ^ 

we see that b equals 2cr^ multiplied by a constant. 

The degree of symmetry of the curve is indicated by the 
value of c as well as by the value of a. For, when x is positive, 
the term ex is added in the denominator and when x is negative 
it is subtracted. This tends to make the frequency curve steeper 
to the left than to the right of the origin, and hence the curve 
must extend farther to the right, that is, the curve must be skew.* 

But it was seen in Chapter V. that jS^ is the fundamental 
measures of skewness. Therefore both a and c must contain jS^ 
as a factor. 

When x"^ is small the constant d has little effect on the 



* See page 57, Chapter V. 
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slc^e, but for the extremities of the curve where x and hence 
d x^ is large the slope is reduced by a large value of d. It will 
be seen that d depends largely on /Sg- 

The Types of Curves. We may now discuss the distinct 
types of curves that possess the slope properties of the general- 
ized slope equation. Distinct types of curves result ctccording as 
the denominator, b '\- ex -{■ dx^, has tivo distinct factors, two co- 
incident factors, or has no factors. 

With two distinct factors the slope equation can be written 

I dy x-\' a jr + a 

= k . 



y dx b + ex + dx^ (r + x) (r^ — x) 

where i is a constant. 

By the usual mathematical methods we then have 

, , , , k(a — r,) —^((i+r,) 
3^= / (^ + ^) : . {r, — x) ' (A) 

where y' is the constant of integration. 

By a simple transformation and rearrangement, this equa- 
tion can be reduced to the form of Pearson's first type, namely: 






V = V 



Exercises. 

1. Carry through in detail the necessary transformations to de- 
termine the equation of Type I from equation (A). 

2. Perform the integrations to obtain the curve of Type I. 

When jj and aj are equal it is readily shown that m^ = w^ 
and the equation takes the form of Type II : 

3- = Vo (i — y . Type II. 

When one root of the denominator h '\' ex -{-dx^ is indefi- 
nitely large, that is, when d is zero, we have, from the theory 
of the exponential e, the third type : 



-..-y^(,+iy; 



y = yoe \i-\ 1. Type III. 
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This equation may be looked upon as that of Type I with a, 
indefinitely large. 

The curves of Type III are especially serviceable because 
the equations are simple in form and convenient for computa- 
tion. They are the most elementary skew curves. 

By transforming expression (A), in a manner somewhat 
different from that to obtain Type I, the form of Pearson's 
sixth type is readily obtained. It is 

V = y^y {x — a)^2X ~"*i. Type VI. 



Exercises. 

3. Obtain the equation of Type II by direct integration from the 
differential equation. 

4. Compare Type II with the normal curve. 

5. Obtain Type III directly by integration. 

6. Obtain Type III from (A). 

7. Compare the shape of Type III with that of the normal curve. 

8. Obtain the equation of Type VI directly from the differential 
equation. 

10. Is Type VI geometrically distinct from Type I? 

When two roots are indefinitely large we have the normal 
curve : t 

which is called simply "Normal" in Pearson's scheme of classifi- 
cation. 

With two coincident roots, the slope equation becomes 

I rfy .r + a 



y dx {x + rY 

y 

This leads to the form y = y^x^^^e ^ , Type V. 

which is Pearson's type V. 

Exercises. 

11. Derive in detail the equation of Type V. 
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When the denominator of the slope equation cannot be 
factored the integration is performed by writing 

I dy X -{- a 

y dx h -\- ex -{- dx^ ' 

c c 

X -^ f- a 

2d 2d 



c c^ 

x^ H x^ 

d 4d' 



2 N 



b c 



This gives 



="•(■+?) 



X^ \ "*** —M tan a 

e 



which is the form of Type IV. 



Type IV. 



y 



Exercises. 

12. Derive in detail the equation of Type IV. 

13. Derive the equation of Type IV by transformation from the 
equation of Type I. 

14. Compare the form of the equation of Type IV to that of Type III. 

If y is zero in the immediately preceding equation we have 
Pearson's Type VII. 

= '\' + a^) Type VII. 

The Intercepts. The intercepts made on the X-axis by 
the various types of curves can now be examined. The follow- 
ing theorem is fundamental in the theory of the intercepts of 
Pearson's curves: an incommensurable power of a negative 
number does not exist. 

Let — N denote any negative number and ( — N)P = r (cos ^»-|- 



V — r sin pir) where V — 1 is the square root of negative unity. Unless 

p is an integer sin pir is not zero and hence ( — N)p contains V^l which 
has no arithmetical value. Hence powers of — N which are not integral 
do not exist. 

In Type I the intercepts are — a^ and ag. Since a^ and aj are 
not integers, the curve stops at the X-axis and there are no points 
below that axis. Indeed, there are no negative ordinates on any 
of the curves. 
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In Type II the intercepts are of the same length and numeri- 
cally equal to a. 

In Type III one intercept is — a and the other is indefinitely 
large. 

In the case of the normal curve both intercepts are indefi- 
nitely large. 

In Types IV and VII there are no intercepts. 

In Type V one intercept passes through the origin and the 
other is indefinitely large. 

In Type VI both intercepts are positive or both are negative. 

Ordinarily the type of curve selected should have intercepts 
harmonizing with the natural limits of the range of the data. 
For instance, data necessarily Hmited in either direction should 
be smoothed with a curve correspondingly limited. However 
nearly all the curves are practically limited in range because the 
ordinates soon become negligible, so that the matter is not one 
of great importance; tho a somewhat better fit is likely to be 
obtained with a curve limited in accordance with the data. 

Exercises. 

15. Of what types is the normal curve a limiting curve? 

16. Distinguish between a curve with indefinitely large intercepts 
and a curve with imaginary or non-existent intercepts. 

17. Show that there are indefinitely more curves of Types I, VI and 
IV than of Types III, V, II or VII, or of the normal curve. 

18. Show how Type I can be said algebraically to include Type IV. 

19. Show that Types I and VI are not fundamentally distinct. 

20. Show that by taking all combinations of sign into account there 
are three distinct classes of curve under Type I. 

21. Show that there are two sub-classes under Type II according 
as the exponent m is positive or negative. 

22. Show that there are two classes under Type III. 

23. Is there more than one general form of curve under Type IV? 
Under type V? 

24. Discuss the curves of Type VI as to the existence of sub- 
classes within the Type. 

25. What types of these curves have asymptotes? 

26. Do all the curves have a mode? 

27. Find the points of inflexion for each type. 

The Criterion K. Since the separation into types depends 
primarily on the nature of the roots of the quadratic, 
fe -f- c;r + d^y ^^^ discriminant of this quadratic constitutes a 
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criterion of the type of curve which fits the distribution. The 
values of a, b, c, and d are first determined by the method of 
moments and then the discriminant expressed in terms of the 
computed expressions for b, c, and d. 

The formula for K, the discriminant obtained in this way is 

4i2P2 — 3P^ — 6){4P2 — 3Pi) 

This formula for K is derived as follows : 

The differential equation 1/y dy/dx = — (x + a)/(6 +cx + dx^) may 
be written (6 + cjr+rfjr') dy/dxz=:y{x + a). Multiplying each side by 
jr", we have 

^x^(b-\- cx-\- dx^)dy=^ — \^y{x + a)x^dx. On integrating the left 
side by parts 

x^lb + cx + dx*)y'-nb\x^-^y dx— {n-\'l)c^x^y rfjr-— (n + 2)d 
jjr"-P y dx=- — {y :rn+* dx — a{y x^dx. 

With the usual notation, where /*'„ = ^x^y dx, 
x^{b + cx + dx')y^nbn\_, - (H+l)rM'„- (n + 2)rfM'„^i = ^M'„ + i 
— omV 

If y is very small at the ends of the range the first expression van- 
ishes and the moment equation connects the three moments /»'„_ i, /*'„, 
and M'n ^ 1. 

On rearranging this equation we have 

«|'.-"fr'"„_.-(» + l)r-(n + 2)JM'„+, =-!''„ + .. 

Since the moment, /*'„ = 1 and, if the mean is taken as origin, 

m'i = we have for n = 0, 1, 2, 3, respectively the four equations: 

a — r= 
b — 3(/a4 = —AS 

afh — Zbt^ — 4f Ah — 5^/1*4 = — ^fi« 

On solving this set of equations and substituting in the differential 
or slope equation, we have 

X + m«(m4 + 3/*,*) 

1 dy 10Ai,M4 — 18/At' — 12/is" 

" • 

y dx f^{ithM'4 — Sfi\) M8(/** + 3m,*) 2/*aAt4— 3m*«— - 

. + x + ^ 

lO/i^M* — 18/*t* — 12m** 10/i*M4 — ISfh* — 12fH* 10/i^ — 18a*,* — 12fi\ 

In terms of /3, and ft this becomes 

X + VlhVW(fi2 + S) 
1 dy 2(5^ — 6/3, — 9) 



y dy Ma(4/5, — 3/8,) + VSV/3,(A + 3)jr+ (2^, — 3A — 6)^ 



2(5/3, — 6A — 9) 
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The discriminant of the quadratic denominator is the required 
criterion, K. It is easily shown that 

K = . 

4(2A-3^, — 6)(4^, — 3A) 

For, the quadratic expression, </jr* -\- ex -{- b, may be written 



2d 






Hence the character of 



2d 

the two factors depends on the value of the quantity (<^ — 46rf). When 
this is zero the two factors are equal; when it is negative there are no 
factors, etc. Writing (c* — \bd) in the form (c^/ibd~^i) 4bd we have, 

if K = , the following classes of factors according to values of K: 

4bd 

If K < 0; that is, if K is negative, the factors are unequal, because 
a negative sign for c*/4bd must be due to unlike signs for b and d and 
hence the product 4bd must be negative. That is, c*/4bd is positive for 
K<0. 

If K is positive there are two cases, according as K is greater or less 
than unity. If K lies between zero and positive unity, (c* — 4bd) is nega- 
tive and consequently there are no factors. If K > 1, (c^ — 4bd) is 
again positive and the factors are unequal, etc. 

The Value of K and the Types of Curve. The following 
table gives the types of curves corresponding to the different 
values of K. 

K < o, i. e. negative Type I. 

f ^j z= o, )82 = 3 Normal Curve. 

^ = oJ p, = o,p^<3 Type II. 

( 5i = o, i5, < 3 Type 11. 

K > o < I Type IV. 

K = I Type V. 

K > I, but not indefinitely large, Type VII. 

K > I and indefinitely large. Type III. 

It is to noted that the types of curve for any given sta- 
tistical distribution can now be determined by strictly arithmetic 
methods. 

The only restriction on the generality of the theory of the criterion 
K is that the quantity x^(b-\-cx-{- dx')y must vanish at both ends of the 
range. This condition marks the pairs of values of ft and A for which 
no curve of the generalized differential equation can be found. The 
limiting values of ft and ft are ft > }ft and ft > ft/8 + 9/2 (see 
Exercises 29 and 30 below). 



INTRODUCTION TO MATHEMATICAL STATISTICS 1 39 

Exercises. 

28. Read the explanation to Tables XXXV-XLVI in "Tables for 
Statisticians and Biometricians/' 
29.* Derive the formulas 

/3„(even)^ (n + 1) i l/2^„.^ + (1 + ia)/3„.^ ^ / ^ l-i(n-.l)a \ 
/3„ (odd) = (n + 1) i l/2/S,i3„_,+ (l+ia)/3„_^ } / ^ l-}(n-l)a ^ 
where a= (2ft — 3/3i — 6) / (/3, + 3). 

30. From the computation formulas for Type II, prove that 
m is negative when ft < 1.8. 

31. Prove from the working formulas of Type I that Type I in- 
cludes three sub-classes according to the signs of mi and ntt. Derive the 
criterion curve, 

ft(8ft — 9ft — 12) (4ft — 3ft) = (10/3, — 12ft — 18)* (/82 + 3)«* 

32. Prove that ft >ft 

33. Prove the relation ft > 15/8/9 + 9/2. 

34. Show that a large value of ft for the curves derived from the 
generalized differential equation denotes a comparatively flat-topped 
curve. 

35. Show that for the normal curve with a = o, we have & = ^, 

The Computation Formulas. The computation formulas 
for the several types of Pearson's frequency curves are derived 
in accordance with the method of moments. For each type as 
many moment equations are written as there are constants in the 
equation of a curve of the type. In some of the type equations, 
as in Type I where a^/m^ = 0^2/^21 the constants are connected 
by equations so that the number of moment equations is reduced. 
The moment equations are the result of equating the theoretical 
moments of the curve obtained by integration to the moments 
computed directly from the data. 

It might be expected that the differential equation in terms of fh 
and the P's would be integrated to give the equations directly, but the 
present process is more convenient. The chief purpose, therefore, of 
the slope or differential equation is for the determination of the type 
forms of the equations. After the algebraic forms of the equations are 

determined each type is worked out without making use of its connection 
either with the slope equation or with other type forms. 



* See page Ixiii, of "Tables for Statisticians and Biometricians." 
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The expression r (/>), called the gamma function, occurs in 
the following formulas. This function is defined by the relation 

^(P) = (P — i) r(^ — I). 

If /> is an integer, r(/>) = |^:^il 
If p is not an integer, r (/>) = {p — i) (p — 2) . . , . 
(p — p -{■ 2) T P where P is the remainder after subtracting a 
sufficient number of I's to bring p down to between 2 and i in 
value. The values of r (P) are given in Table XXXI of 
"Tables." 

The probable errors of K as well as of fi^ and p^ ^re given 
in "Tables." 

The derivation of the following computation formulas, ex- 
cept the moment formulas, is not possible without an extensive 
acquaintance with the calculus.* 

After the constants in the equation are computed the 
smoothed frequencies are obtained by computing the areas under 
the curve and between the bounding ordinates. Thus the fre- 
quency of the first class is the area between the ordinate x = i 
and X =:, i^. Simpson's quadrature formula is ordinarily used 
for finding the class areas. According to this formula the area 

is 1/6 ]y x-i +4y X +y r + i } where yx-.% andiVx-f-i 
are the bounding ordinates and Vx is the mid-ordinate of the class. 

Formulas for the Moments. 

V2 = 2S^ — d(i-\-d), 

Vs=6S, — 3v,(l+d)—d(l+d) (2 + d), 

y^=24S, — 2v,i2(l + d) + l[ — y,\6{l +d) (2 + a)- 

i[-d(i+d) i2 + d) (3 + d). 

^= Va*2 

Pi = fH' -^ M^^ 

P2 = /*4-^/*2^ 

Pi{P2 + zy 



K=r 



AUp2 — ZPi){2p2—iPr — ^) 



* See Elderton "Frequency Curves and Correlation," C. & E. Layton, 
for a thoro discussion of the deriviations. 
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The computation formulas for Type I are as follows: 
The equation is, 



y = yo 



1 + 



»- ^ 



«iJ 



m, 



where a^/m^ = 0^/1112. 



I 

a.. 



We have 



r =L 



(>{P2—P.-1) 



b' = 



3A 


-P2 + 6 






4r» 




i6(r 


+ i)+|8i(^ 


+ 2y- 


A»2(^ 


•+i)r= 





fitz and nil ^^^ given by the formulas 



The constant m^ is taken with the negative root when /ji^ is posi- 
tive and with the positive root when /LI3 is negative. 

^1 + ^2 = ^• 

Gi and Oj can be found from the relations a^-^- a^^ b and 



yo = — 



r(m, + m2 + 2) 



b (mi + w.) '"^ + *"•-' r(mi + I )r(m,+i ) 

_ r r + 2 ^ 

The skewness is ^ Vi^i { > 

I r— 2 J 



Mode = mean — ^ 






The formulas for Type II are as follows The equation 
for this type is 



y = yo 



I — 

a-) 



m 
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The formulas are 



tn = 



a2 = 



5)^2 — 9 
2(3-i9,) 

2ti2p2 



Jo = 



3 — iS^ 

N r(2W + 2) 



a.22^ + i^r(w+ I)}-' 
Type III. The equation is 



y=yo e 

The formulas are, 



— yx 



\ 



I -J-- 

aj 



ya 



y = 2 






a 



W 



Jo = - : , where p =z va, 

aePT{p+i) 

Mode = mean 



I 
Skewness = — 

Type IV. The equation is 

1 r 

- 7 tan — , 
e « 



3^ = 3^0 
The formulas are : 






a' 



r =z 



6(p2-p^ — i) 



2P2 — 3iSi — 6 

4r^ 
i6{r—i)—p,(r — 2y 



I rcr 



a =z 



2 yje 
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cos *<^ 1 

Vo — ^ — > — , where tan <f> = — . 

27r (cos<^)'' + i ;. 

Origin = mean -| — . 

r 

AT 1 ^ MaC'' — 2) 

Mode = mean — . 

T3'pc V. The equation is 

The formulas are: 



8 + 4V(4 + A) 
Pi 
y = (P — 2) V/*2(/' — 3). with sign same as that of /tj 

Nyf-» 



Jo 


r(p- 


» 

I) 




sk. 


2V/'- 

P 


-3 




Origin : 


= mean — 


y 
P — 


2 


Mode — 


mean — 


2y 





Type VI. The equation is 

y = y^ (x — a)(i^x —qi . 

The formulas are: 



r = 



e = 



6 + 3P, — 2P, 

4r^ 



i6(r+i)+^,(r+2) 



r r + 2 

l—q,= 1 V^i^i, 

2 4 

r r + 2 

2 4 
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r<T 



a =^ 



yo = 






r(gi — ^2— i)r(g2 + i) 



^(^1 — 

Origin =: mean 



^1 — ^2 — 2 



1 fin r + 2 
Mode := mean — — • 

2 fi^ r — 2 



Type- VII. The equation is: 



X 



2 \ — m 



The formulas are: 



5|8.- 


■9 


2()8,- 


-3)' 


2,hP 

a" 


2 

• 

3 


N 


rw 



ay/w T(m — i) 

Normal Curve. The equation, as was proved in Chapter 
VI, is 

N 



v= - 



. e 



2<r« 



yj2-na 



iuid the curve was discussed in that chapter. 
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